Open simonw opened 1 year ago
https://datasette.io/tutorials could grow a "how-to" or "recipes" section.
Great idea. I think there are 3 steps.
What I did, but there is probably better options: I imported my CSV "as is" and then use Datasette to copy and paste request 'CREATE TABLE" at the end of the page, and then edited it by hand to select the different field types. I have tried different tools to generate a good schema but none of them worked well... For example: our code
is often detected as an INTEGER while it's not (some codes begin with 0).
Create a file with one index per line. Eg.
CREATE INDEX ["all_countries_en"] ON [all]("countries_en");
time sqlite3 products_new.db <<EOS
/* Here we could add examples of PRAGMA optimisations */
/* [...] */
/* Import schema */
.read create.schema
/* Options to be tuned */
/* .mode ascii or .mode csv */
.mode ascii
/* .separator is not necessary when .mode csv */
.separator "\t" "\n"
/* --skip 1 is only useful with .mode ascii, if there is a header */
.import --skip 1 en.openfoodfacts.org.products.csv all
/* At the end, create all the index */
.read index.schema
EOS
About optimisation, the only thing that works for me for a big CSV (6.2GB, 2.5 millions rows, 185+ fields) is:
PRAGMA page_size = 32768;
Before: CSV 6.2GB => time 3m2 => DB 9.5GB
After: CSV 6.2GB => time 2m3 => DB 6.6GB
It's open data, anyone should be able to reproduce my test (needs ~18GB):
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
CREATE=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/create.sql`
INDEX=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/index.sql`
time sqlite3 products_new.db <<EOS
/* Optimisations. See: https://avi.im/blag/2021/fast-sqlite-inserts/ */;
PRAGMA page_size = 32768;
$CREATE
.mode ascii
.separator "\t" "\n"
.import --skip 1 en.openfoodfacts.org.products.csv all
$INDEX
EOS
Using the 2.5GB CSV from this may be good here: https://simonwillison.net/2022/Sep/29/webvid/
This came up on Discord: https://discord.com/channels/823971286308356157/823971286941302908/1017006129831739432
If you have a large (e.g. 9.5GB) CSV file it's not obvious how best to import it.
csvs-to-sqlite
tries to load the whole thing into RAM, which isn't ideal.sqlite-utils insert
can stream it, which is better, but it's still quite slow. The best option is actually to create the table manually and then usesqlite3 .import
to import the CSV, as described here: https://til.simonwillison.net/sqlite/import-csv - but it's not exactly obvious!Documentation could help here.
This is actually more of a "how-to" in the https://diataxis.fr/ framework as opposed to a tutorial.