usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
130 stars 42 forks source link

Investigate using filetype other than CSV for intermediate outputs of the backend process, to potentially speed up score generation process #1680

Open lucasmbrown-usds opened 2 years ago

lucasmbrown-usds commented 2 years ago

Is your feature request related to a problem? Please describe. This was originally mentioned in a design document for The Big Refactor: Change filetype of intermediate outputs from csv to another filetype.

Using feather or parquet would be really simple code changes but could speed up I/O.

For instance, every ETL class loads data from an external data source, and then writes a big output file to CSV. The score generation process loads 24-ish of these big CSV files and combines them, generates a bunch of scores, then writes them to CSV. Then the tile generation process loads the big score CSV and turns it into tiles and downloads. Then the comparison tool loads the big score CSV and runs a bunch of reporting on it and writes Excel files.

Everywhere in the above sentence that says "CSV" could be replaced with feather, parquet, or something similar to dramatically speed up the time it takes to read and write the files.

Describe the solution you'd like

Describe alternatives you've considered

lucasmbrown-usds commented 2 years ago

This comment seems to indicate the speedup of some of these approaches would be substantial:

feather with "zstd" compression (for I/O speed): compared to csv, feather exporting has 20x faster exporting and about 6x times faster importing. The storage is around 32% from the original file size, which is 10% worse than parquet "gzip" and csv zipped but still decent.

lucasmbrown-usds commented 2 years ago

Wow, this post implies a 150x speed improvement.

That’s a drastic difference — native Feather is around 150 times faster than CSV.

Overall that post is really good: it has a pretty compelling title 😂 "Stop Using CSVs for Storage — This File Format Is 150 Times Faster. CSV’s are costing you time, disk space, and money. It’s time to end it."