Open CharlesNepote opened 1 year ago
As a Parquet file can be manipulated with SQL queries thanks to duckdb, it should be possible to merge new data with older data thanks to a unique SQL request (tests to be done).
@CharlesNepote thanks a lot for following up on our conversation on this topic, and investigating the gains !
The gains in file size and query speed are impressive.
Two questions:
Your latest request is slow, but I expect the execution time to increase very slowly if you select 2, 100, 1000, 10000, 1000000 or all products (ie. the execution time is clearly dominated by the number of columns). Would you be able to confirm this, or correct my assumption ?
Parquet enables to split the dataset in files corresponding to subsets defined by criteria (eg. days for time-series with sub-hourly data). I was thinking that the barcode could be a good key to split the dataset, eg. one file for products with an EAN-8 code, another for products with EAN-13, and another for all other codes ; or more fine-grained splits based on the GS1 code prefixes. I assume they are natural subsets that are frequently queried separately, eg. 99% of scans in France would be on products within 1 to 5 GS1 prefixes hence could be used to cache results if relevant, or you could apply different processing routines or define different quality checks on these subsets.
This issue is stale because it has been open 90 days with no activity.
A parquet version would be very welcomed indeed! Currently, parsing en.openfoodfacts.org.products.csv is difficult due to, apparently, 15 lines with 'ingredients_text' containing \"
(for instance: \"ce produit contient des sulfites\"
).
Anyway, this DuckDB query seems to do the job: converting to a parquet version (840 mo).
COPY (
FROM read_csv_auto('C:/.../en.openfoodfacts.org.products.csv', quote='\\"')
) TO 'C:/.../en.openfoodfacts.org.products.parquet';
And then, this sample query
SELECT last_modified_by, count(*) FROM 'C:/.../en.openfoodfacts.org.products.parquet'
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;
will run in 200 ms
Hi @CharlesNepote and others usual partners in (open data) crime !
This is a very encouraging discussion !
I am π― % in favor to having a parquet file alongside csv (since parquet is not yet universally known).
We aim to use parquet more and more in the French statistical system :
In Insee's innovation teams, we advocate a lot for a more general use of parquet files.
Tad
is quite nice and its long standing issue about proxies is now solved.I think there might be some variables that can be quite challenging in the prospect of getting a small parquet that can be handled easily. Among them I see : categories and ingredients.
It looks like they have, at the same time, sparse columns but, when information is present, they stack together very long strings.
This is, I think, the potential limit of parquet
for your use case: you don't always have flat information in your database system.
Maybe these columns could be in a separate parquet file that would store basic nutritional facts, alongside product basic information (name, EAN, etc.). This could give a lightweighted file allowing most users to handle basic nutritional data. A star schema that would allow joining together multiple variables in different files could then do the trick.
I don't know if this idea is a good. However, in my opinion, if you want to further external web apps or dataviz connected to your dataset, this could help.
I have made some tests to evaluate:
(@ericemc3 import seems to be ok if you provide quote=''
.)
$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.parquet';
EOF
real 2m4,673s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.parquet
-rw-r--r-- 1 charles charles 839M 25 nov. 10:45 en.openfoodfacts.org.products.2023-11-24.parquet
Is smaller ROW_GROUP_SIZE
than the 122880 default change things? With 50000, size is 17 MB bigger (2%), and time is 11s smaller (9%).
$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.gps50000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 50000);
EOF
real 1m53,181s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps50000.parquet
-rw-r--r-- 1 charles charles 856M 25 nov. 11:30 en.openfoodfacts.org.products.2023-11-24.gps50000.parquet
Is bigger ROW_GROUP_SIZE
than the 122880 default change things? With 300000, size is 13 MB smaller (1.5%), time is 15s bigger (12%).
time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.gps300000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 300000);
EOF
real 2m19,405s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps300000.parquet
-rw-r--r-- 1 charles charles 826M 25 nov. 11:22 en.openfoodfacts.org.products.2023-11-24.gps300000.parquet
Is compression useful? Clearly yes: size is 313 MB smaller (37%), with a small impact on speed (13s, 10%).
$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
real 2m17,476s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.zstd.parquet
-rw-r--r-- 1 charles charles 527M 25 nov. 11:18 en.openfoodfacts.org.products.2023-11-24.zstd.parquet
Compression + ROW_GROUP_SIZE 50000
is providing not so much differences: ~ same time as with no option, and size = 541 MB. In our context priority should be given to file size to lower the barrier to usages. Time saving during parquet creation are low (~10%) when playing duckdb options. In conclusion I would tend to use zstd compression without any other option.
In my next tests:
If we add the products that have been modified to the parquet file, we are able to request only the last version of the products with: select *, max(last_modified_date) from "products.parquet" group by code;
See: http://sqlfiddle.com/#!5/5be4d/1/0
TODO: see if it's possible to create views or some mecanisms to:
Is this useful ?
SELECT code, arg_max(id, ldate::date) FROM tbl GROUP BY code;
Side note: for Open Food Facts CSV this is very important to use sample_size = 3000000
to allow column types detection.
Otherwise, many *_100g
fields are detected as BIGINT or VARCHAR instead of DOUBLE.
All *_100g
fields should be detected as DOUBLE except nutrition-score-fr_100g
and nutrition-score-uk_100g
which are BIGINT. Using sample_size = 3000000
, the import is longer but types detection is impressive.
$ ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.csv', quote='', sample_size = 3000000)
) TO 'en.openfoodfacts.org.products.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
$ ./duckdb x.db "DESCRIBE SELECT * FROM 'en.openfoodfacts.org.products.zstd.parquet';" -box
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββ¬βββββββ¬ββββββ¬ββββββββββ¬ββββββββ
β column_name β column_type β null β key β default β extra β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββΌβββββββΌββββββΌββββββββββΌββββββββ€
β code β VARCHAR β YES β β β β
β url β VARCHAR β YES β β β β
β creator β VARCHAR β YES β β β β
β created_t β BIGINT β YES β β β β
β created_datetime β TIMESTAMP β YES β β β β
β last_modified_t β BIGINT β YES β β β β
β last_modified_datetime β TIMESTAMP β YES β β β β
β last_modified_by β VARCHAR β YES β β β β
β product_name β VARCHAR β YES β β β β
β abbreviated_product_name β VARCHAR β YES β β β β
β generic_name β VARCHAR β YES β β β β
β quantity β VARCHAR β YES β β β β
β packaging β VARCHAR β YES β β β β
β packaging_tags β VARCHAR β YES β β β β
β packaging_en β VARCHAR β YES β β β β
β packaging_text β VARCHAR β YES β β β β
β brands β VARCHAR β YES β β β β
β brands_tags β VARCHAR β YES β β β β
β categories β VARCHAR β YES β β β β
β categories_tags β VARCHAR β YES β β β β
β categories_en β VARCHAR β YES β β β β
β origins β VARCHAR β YES β β β β
β origins_tags β VARCHAR β YES β β β β
β origins_en β VARCHAR β YES β β β β
β manufacturing_places β VARCHAR β YES β β β β
β manufacturing_places_tags β VARCHAR β YES β β β β
β labels β VARCHAR β YES β β β β
β labels_tags β VARCHAR β YES β β β β
β labels_en β VARCHAR β YES β β β β
β emb_codes β VARCHAR β YES β β β β
β emb_codes_tags β VARCHAR β YES β β β β
β first_packaging_code_geo β VARCHAR β YES β β β β
β cities β VARCHAR β YES β β β β
β cities_tags β VARCHAR β YES β β β β
β purchase_places β VARCHAR β YES β β β β
β stores β VARCHAR β YES β β β β
β countries β VARCHAR β YES β β β β
β countries_tags β VARCHAR β YES β β β β
β countries_en β VARCHAR β YES β β β β
β ingredients_text β VARCHAR β YES β β β β
β ingredients_tags β VARCHAR β YES β β β β
β ingredients_analysis_tags β VARCHAR β YES β β β β
β allergens β VARCHAR β YES β β β β
β allergens_en β VARCHAR β YES β β β β
β traces β VARCHAR β YES β β β β
β traces_tags β VARCHAR β YES β β β β
β traces_en β VARCHAR β YES β β β β
β serving_size β VARCHAR β YES β β β β
β serving_quantity β DOUBLE β YES β β β β
β no_nutrition_data β VARCHAR β YES β β β β
β additives_n β BIGINT β YES β β β β
β additives β VARCHAR β YES β β β β
β additives_tags β VARCHAR β YES β β β β
β additives_en β VARCHAR β YES β β β β
β nutriscore_score β BIGINT β YES β β β β
β nutriscore_grade β VARCHAR β YES β β β β
β nova_group β BIGINT β YES β β β β
β pnns_groups_1 β VARCHAR β YES β β β β
β pnns_groups_2 β VARCHAR β YES β β β β
β food_groups β VARCHAR β YES β β β β
β food_groups_tags β VARCHAR β YES β β β β
β food_groups_en β VARCHAR β YES β β β β
β states β VARCHAR β YES β β β β
β states_tags β VARCHAR β YES β β β β
β states_en β VARCHAR β YES β β β β
β brand_owner β VARCHAR β YES β β β β
β ecoscore_score β DOUBLE β YES β β β β
β ecoscore_grade β VARCHAR β YES β β β β
β nutrient_levels_tags β VARCHAR β YES β β β β
β product_quantity β DOUBLE β YES β β β β
β owner β VARCHAR β YES β β β β
β data_quality_errors_tags β VARCHAR β YES β β β β
β unique_scans_n β BIGINT β YES β β β β
β popularity_tags β VARCHAR β YES β β β β
β completeness β DOUBLE β YES β β β β
β last_image_t β BIGINT β YES β β β β
β last_image_datetime β TIMESTAMP β YES β β β β
β main_category β VARCHAR β YES β β β β
β main_category_en β VARCHAR β YES β β β β
β image_url β VARCHAR β YES β β β β
β image_small_url β VARCHAR β YES β β β β
β image_ingredients_url β VARCHAR β YES β β β β
β image_ingredients_small_url β VARCHAR β YES β β β β
β image_nutrition_url β VARCHAR β YES β β β β
β image_nutrition_small_url β VARCHAR β YES β β β β
β energy-kj_100g β DOUBLE β YES β β β β
β energy-kcal_100g β DOUBLE β YES β β β β
β energy_100g β DOUBLE β YES β β β β
β energy-from-fat_100g β DOUBLE β YES β β β β
β fat_100g β DOUBLE β YES β β β β
[...]
β nutrition-score-fr_100g β BIGINT β YES β β β β
β nutrition-score-uk_100g β BIGINT β YES β β β β
[...]
β sulphate_100g β DOUBLE β YES β β β β
β nitrate_100g β DOUBLE β YES β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββ΄ββββββ΄ββββββββββ΄ββββββββ
I have started to build the bash script to create the parquet conversion AND the update of the parquet file with new products. Below for those who are curious (it's simple bash with many comments, you can try it, it should work out of the box if you're using Linux).
I'm still facing three issues.
I'm facing errors from DuckDB yelling that there are UTF8 errors. But I'm not sure I can reproduce clearly. The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv Though, I find it annoying to convert the CSV and waste time and 8GB+ for this.
I'm adding updated products to the previous parquet file, building a completely historized file (see #9355). But the SQL to request only the relevant data is not so easy. What works with SQLite doesn't with DuckDB. I made it work with another query (see request 7 below), but it's more complex (and maybe bad for performance).
Last, but not least, historization is based on last_modified_datetime
which is not taking into account "silent" updates made by Product Opener. See #8860.
#!/usr/bin/env bash
# duckdb executable path (without trailing slash)
DP=~
# TODO: gather stats and save them a log file; save info about main operations in the log file
# 0. find there is an old parquet export in the directory
[[ -f "en.openfoodfacts.org.products.parquet" ]] && OLD=1 || OLD=0
# 1. download latest CSV
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file
# Discard invalid characters
# duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
# Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
# (occuring namely on this product: https://world.openfoodfacts.org/product/9900109008673?rev=4 )
# The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv
# 2. Create new temporary parquet file. From 2 to 5 minutes, depending on your machine
[[ -f "en.openfoodfacts.org.products.tmp.parquet" ]] && rm en.openfoodfacts.org.products.tmp.parquet
$DP/duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote='', sample_size=3000000, delim='\t')
) TO 'en.openfoodfacts.org.products.tmp.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
# 3. If a parquet file is already present, then merge the new data
if [[ $OLD = 1 ]]; then
# Find last last_modified_date in the old parquet file
LATEST_PRODUCT_MODIFICATION=$(
$DP/duckdb :memory: -noheader -ascii -newline '' <<EOF
SELECT last_modified_datetime FROM read_parquet('en.openfoodfacts.org.products.parquet')
ORDER BY last_modified_datetime DESC LIMIT 1;
EOF
)
# TODO: if the last_modified_date is from today, do not update?
# Create a temporary duckdb DB to merge current parquet file with the new data
# (duckdb is not able to merge parket files directly)
[[ -f "tempo.db" ]] && rm tempo.db
$DP/duckdb tempo.db <<EOF
CREATE TABLE products AS
SELECT * FROM read_parquet('en.openfoodfacts.org.products.parquet');
EOF
# Find all the products that have been modified, and insert them in the temporary duckdb DB
$DP/duckdb tempo.db <<EOF
INSERT INTO products
SELECT * FROM read_parquet('en.openfoodfacts.org.products.tmp.parquet')
WHERE last_modified_datetime > strptime('$LATEST_PRODUCT_MODIFICATION', '%Y-%m-%d %I:%M:%S');
EOF
# Create the new parquet file based on the temporary duckdb DB
$DP/duckdb tempo.db <<EOF
COPY (SELECT * FROM products) TO 'en.openfoodfacts.org.products.new.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
EOF
else # if there isn't a previous parquet file, the temporary parquet file becomes the target result
mv en.openfoodfacts.org.products.tmp.parquet en.openfoodfacts.org.products.parquet
fi
# TODO: gather some stats (nb of products, last modified product, nb of new products, etc.)
<<COMMENTS
# Requests to verify all is OK: you can try them manually by copy/pasting
# Variable to find duckdb paths:
DP=~
# 0. Request to verify the total number of products (including updated ones)
$DP/duckdb :memory: -noheader -ascii -newline '' "
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet');"
# 1. Request to verify the parquet file is coherent with good data in the right columns
# vd is Visidata, an awesome tool you have to know if you're dealing with serious data: https://www.visidata.org/
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 0.99 -- products with a good level of completeness
order by last_modified_datetime limit 10;
" | vd -f csv
# 2. Request to verify the parquet file is coherent with good data in the right columns
# (too long!!!) request (due to random() sort order), but useful to verify data are ok
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 1.2 -- products with a good level of completeness
order by random() limit 10;
" | vd -f csv
# Previous tests before using iconv
# 3. simple request
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') order by last_modified_datetime limit 5;
EOF
=> KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
=> semble-t-il Γ cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
Je rΓ©essaye en ajoutant delim='\t' pour voir => mΓͺme erreur.
# 4. request with a "where" clause, to check performance
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') where countries_en like '%Germany' limit 5;
EOF
=> KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
=> semble-t-il Γ cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
Je rΓ©essaye en ajoutant delim='\t' pour voir => mΓͺme erreur.
# Requests to be tested again
# 5. multiple select + aggregation to see performance
time $DP/duckdb :memory: "select * FROM
(select count(data_quality_errors_tags) as products_with_issues
from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != ''),
(select count(data_quality_errors_tags) as products_with_issues_but_without_images
from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != '' and last_image_datetime is null);"
# 6. request to read the last version (ie the current) version of the database
$DP/duckdb :memory: <<EOF
select *, max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet')
group by code
order by last_modified_datetime desc
limit 5
;
EOF
=> KO! Error: near line 1: Binder Error: column "creator" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(creator)" if the exact value of "creator" is not important.
# 7. request to read the last version (ie the current) version of the database (tried another way)
# TODO: create a view?
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
where last_modified_datetime =
(select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
where t1.code = t2.code)
order by last_modified_datetime desc
limit 5
;
EOF
# 8. same request as 7, but counting the number of products
$DP/duckdb :memory: <<EOF
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
where last_modified_datetime =
(select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
where t1.code = t2.code)
;
EOF
# 9. idem 7 but another way
$DP/duckdb :memory: <<EOF
SELECT DISTINCT ON (code)
code, *
FROM read_parquet('en.openfoodfacts.org.products.new.parquet')
ORDER BY code, last_modified_datetime DESC;
EOF
=> KO prend Γ©normΓ©ment de mΓ©moire
COMMENTS
Pour la 7, je peux suggΓ©rer :
SET threads = 6; -- je réduis le nb par défaut, ça accélère un peu
FROM 'c:\...\en.openfoodfacts.org.products.parquet'
QUALIFY rank() OVER(PARTITION BY code ORDER BY last_modified_datetime DESC) = 1;
50 secondes sur mon portable windows.
L'idΓ©al Γ©tant d'avoir constituΓ© le fichier parquet en triant au prΓ©alable sur code
et last_modified_datetime
Et pour la 8 :
WITH last_modified AS (
SELECT code FROM 'en.openfoodfacts.org.products.new.parquet'
QUALIFY max(last_modified_datetime) OVER(PARTITION BY code) = last_modified_datetime
)
SELECT count(*) FROM last_modified ;
1 seconde
ou plus simplement, et notamment si last_modified_datetime
est parfois doublonnΓ© pour un mΓͺme code
:
SELECT count(DISTINCT code) from 'en.openfoodfacts.org.products.new.parquet'
Apache Parquet is a file format aiming to manipulate data more easily.
Our current CSV (2023-03): 7.5 GB Parquet file generated from our current CSV (thanks to csv2parquet): 643 MB with internal zstd compression. Here are the few steps to reproduce:
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file
csvclean -t en.openfoodfacts.org.products.csv # clean up the CSV and convert it from TSV to CSV -- ~6'30s on my laptop
./csv2parquet --header true -n en.openfoodfacts.org.products_out.csv products.pqt > parquet.schema
code
field =>"data_type": "Utf8",
instead of"data_type": "Int64",
serving_quantity
field =>"data_type": "Float64",
instead of"data_type": "Utf8",
time ./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt # ~1'45s on my laptop
It's a "young" format (born in 2013). Some tools are already reading/writing Parquet files. The easiest way to read/write Parquet files is to use duckdb (as easy to install as sqlite), natively reading or writing it (without import). To give a simple example:
The same query on a SQLite DB build from the same CSV export takes more than 10s:
You can reproduce this query thanks to our Datasette instance.
Due to the nature of the format (column based), there is an overhead for some queries.
select *
, for example, need to iterate over all the columns (dozens if not hundreds in our case). Here is an example extracting all columns of the two first products.That said, it should be a far better format than CSV for all operations not including all the columns, and even including those ones for complex queries.
(Thanks @moreymat to have suggested me to explore it.)
[EDIT] Repeated the last test with 10,000 and 100,000 products: