Optionally maintain checksums of CSV files for faster updates

Wanted to see if there is interest in a patch that helps speed up our workflows significantly, or if there are any further ideas for improving on such a feature. If this is out of scope for this project, I'm happy to continue maintaining my fork of this project.

Use Case

We currently maintain a folder of >200 CSV files with a total of a few hundred megabytes, and have a CI step that builds these CSVs into a sqlite database. These CSV files get updated 2-3 times a day, but only small changes are made to them. Currently, running csvs-to-sqlite with the --replace-tables flag takes roughly 6-7 minutes, which is too long for our use case.

Solution

Add a --update-tables flag that maintains a checksum hash of each CSV file in a table called .csvs-meta (happy to change this or make it configurable), and only reads the csv and loads the dataframe if the checksum has changed.

Forked Version Here

simonw / csvs-to-sqlite

Optionally maintain checksums of CSV files for faster updates #85

Use Case

Solution