Tutorial or how-to on importing large CSVs

simonw commented 1 year ago

This came up on Discord: https://discord.com/channels/823971286308356157/823971286941302908/1017006129831739432

If you have a large (e.g. 9.5GB) CSV file it's not obvious how best to import it.

csvs-to-sqlite tries to load the whole thing into RAM, which isn't ideal. sqlite-utils insert can stream it, which is better, but it's still quite slow. The best option is actually to create the table manually and then use sqlite3 .import to import the CSV, as described here: https://til.simonwillison.net/sqlite/import-csv - but it's not exactly obvious!

Documentation could help here.

This is actually more of a "how-to" in the https://diataxis.fr/ framework as opposed to a tutorial.

simonw commented 1 year ago

https://datasette.io/tutorials could grow a "how-to" or "recipes" section.

CharlesNepote commented 1 year ago

Great idea. I think there are 3 steps.

1. Build the schema.

What I did, but there is probably better options: I imported my CSV "as is" and then use Datasette to copy and paste request 'CREATE TABLE" at the end of the page, and then edited it by hand to select the different field types. I have tried different tools to generate a good schema but none of them worked well... For example: our code is often detected as an INTEGER while it's not (some codes begin with 0).

2. Build the index

Create a file with one index per line. Eg. CREATE INDEX ["all_countries_en"] ON [all]("countries_en");

3. Then try to import in one command

time sqlite3 products_new.db <<EOS
/* Here we could add examples of PRAGMA optimisations */
/* [...] */
/* Import schema */
.read create.schema
/* Options to be tuned */
/* .mode ascii or .mode csv */
.mode ascii
/* .separator is not necessary when .mode csv */
.separator "\t" "\n"
/* --skip 1 is only useful with .mode ascii, if there is a header */
.import --skip 1 en.openfoodfacts.org.products.csv all
/* At the end, create all the index */
.read index.schema
EOS

CharlesNepote commented 1 year ago

About optimisation, the only thing that works for me for a big CSV (6.2GB, 2.5 millions rows, 185+ fields) is: PRAGMA page_size = 32768; Before: CSV 6.2GB => time 3m2 => DB 9.5GB After: CSV 6.2GB => time 2m3 => DB 6.6GB

It's open data, anyone should be able to reproduce my test (needs ~18GB):

wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
CREATE=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/create.sql`
INDEX=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/index.sql`
time sqlite3 products_new.db <<EOS
/* Optimisations. See: https://avi.im/blag/2021/fast-sqlite-inserts/ */;
PRAGMA page_size = 32768;
$CREATE
.mode ascii
.separator "\t" "\n"
.import --skip 1 en.openfoodfacts.org.products.csv all
$INDEX
EOS

simonw commented 1 year ago

Using the 2.5GB CSV from this may be good here: https://simonwillison.net/2022/Sep/29/webvid/

simonw / datasette.io