openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - πŸͺπŸ¦‹ Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
GNU Affero General Public License v3.0
614 stars 358 forks source link

Investigate database export as parquet file #7660

Open CharlesNepote opened 1 year ago

CharlesNepote commented 1 year ago

Apache Parquet is a file format aiming to manipulate data more easily.

Our current CSV (2023-03): 7.5 GB Parquet file generated from our current CSV (thanks to csv2parquet): 643 MB with internal zstd compression. Here are the few steps to reproduce:

It's a "young" format (born in 2013). Some tools are already reading/writing Parquet files. The easiest way to read/write Parquet files is to use duckdb (as easy to install as sqlite), natively reading or writing it (without import). To give a simple example:

time ./duckdb test-duck.db "select * FROM
  (select count(data_quality_errors_tags) as products_with_issues 
    from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images 
    from read_parquet('products_zstd.pqt') where data_quality_errors_tags != '' and last_image_datetime == '');"
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ products_with_issues β”‚ products_with_issues_but_without_images β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 156427               β”‚ 13974                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

real    0m0,250s
user    0m0,695s
sys 0m0,140s

The same query on a SQLite DB build from the same CSV export takes more than 10s:

time sqlite3 products.db "select * FROM 
  (select count(data_quality_errors_tags) as products_with_issues
     from [all] where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images
     from [all] where data_quality_errors_tags != '' and last_image_datetime == '');"
156427|13974

real    0m11.038s
user    0m1.016s
sys 0m10.010s

You can reproduce this query thanks to our Datasette instance.

Due to the nature of the format (column based), there is an overhead for some queries. select *, for example, need to iterate over all the columns (dozens if not hundreds in our case). Here is an example extracting all columns of the two first products.

time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 2;" > /dev/null

real    0m4,377s
user    0m7,068s
sys 0m3,642s

That said, it should be a far better format than CSV for all operations not including all the columns, and even including those ones for complex queries.

(Thanks @moreymat to have suggested me to explore it.)

[EDIT] Repeated the last test with 10,000 and 100,000 products:

time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 10000;" > /dev/null

real    0m5,876s
user    0m8,830s
sys 0m3,670s

charles@barbapapa2020:~/ajeter2$ time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 100000;" > /dev/null

real    0m21,756s
user    0m25,963s
sys 0m4,618s
CharlesNepote commented 1 year ago

As a Parquet file can be manipulated with SQL queries thanks to duckdb, it should be possible to merge new data with older data thanks to a unique SQL request (tests to be done).

moreymat commented 1 year ago

@CharlesNepote thanks a lot for following up on our conversation on this topic, and investigating the gains !

The gains in file size and query speed are impressive.

Two questions:

  1. Your latest request is slow, but I expect the execution time to increase very slowly if you select 2, 100, 1000, 10000, 1000000 or all products (ie. the execution time is clearly dominated by the number of columns). Would you be able to confirm this, or correct my assumption ?

  2. Parquet enables to split the dataset in files corresponding to subsets defined by criteria (eg. days for time-series with sub-hourly data). I was thinking that the barcode could be a good key to split the dataset, eg. one file for products with an EAN-8 code, another for products with EAN-13, and another for all other codes ; or more fine-grained splits based on the GS1 code prefixes. I assume they are natural subsets that are frequently queried separately, eg. 99% of scans in France would be on products within 1 to 5 GS1 prefixes hence could be used to cache results if relevant, or you could apply different processing routines or define different quality checks on these subsets.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity.

ericemc3 commented 10 months ago

A parquet version would be very welcomed indeed! Currently, parsing en.openfoodfacts.org.products.csv is difficult due to, apparently, 15 lines with 'ingredients_text' containing \" (for instance: \"ce produit contient des sulfites\").

Anyway, this DuckDB query seems to do the job: converting to a parquet version (840 mo).

COPY (
FROM read_csv_auto('C:/.../en.openfoodfacts.org.products.csv', quote='\\"') 
) TO 'C:/.../en.openfoodfacts.org.products.parquet';

And then, this sample query

SELECT last_modified_by, count(*) FROM 'C:/.../en.openfoodfacts.org.products.parquet' 
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;

will run in 200 ms image

linogaliana commented 9 months ago

Hi @CharlesNepote and others usual partners in (open data) crime !

This is a very encouraging discussion !

I am πŸ’― % in favor to having a parquet file alongside csv (since parquet is not yet universally known).

Parquet files at Insee

We aim to use parquet more and more in the French statistical system :

In Insee's innovation teams, we advocate a lot for a more general use of parquet files.

Other sources of inspiration :

Challenge

I think there might be some variables that can be quite challenging in the prospect of getting a small parquet that can be handled easily. Among them I see : categories and ingredients.

It looks like they have, at the same time, sparse columns but, when information is present, they stack together very long strings.
This is, I think, the potential limit of parquet for your use case: you don't always have flat information in your database system.

Maybe these columns could be in a separate parquet file that would store basic nutritional facts, alongside product basic information (name, EAN, etc.). This could give a lightweighted file allowing most users to handle basic nutritional data. A star schema that would allow joining together multiple variables in different files could then do the trick.

I don't know if this idea is a good. However, in my opinion, if you want to further external web apps or dataviz connected to your dataset, this could help.

CharlesNepote commented 7 months ago

I have made some tests to evaluate:

(@ericemc3 import seems to be ok if you provide quote=''.)

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.parquet';
EOF

real 2m4,673s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.parquet

-rw-r--r-- 1 charles charles 839M 25 nov.  10:45 en.openfoodfacts.org.products.2023-11-24.parquet

Is smaller ROW_GROUP_SIZE than the 122880 default change things? With 50000, size is 17 MB bigger (2%), and time is 11s smaller (9%).

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.gps50000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 50000);
EOF

real    1m53,181s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps50000.parquet

-rw-r--r-- 1 charles charles 856M 25 nov.  11:30 en.openfoodfacts.org.products.2023-11-24.gps50000.parquet

Is bigger ROW_GROUP_SIZE than the 122880 default change things? With 300000, size is 13 MB smaller (1.5%), time is 15s bigger (12%).

time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.gps300000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 300000);
EOF

real    2m19,405s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps300000.parquet

-rw-r--r-- 1 charles charles 826M 25 nov.  11:22 en.openfoodfacts.org.products.2023-11-24.gps300000.parquet

Is compression useful? Clearly yes: size is 313 MB smaller (37%), with a small impact on speed (13s, 10%).

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

real    2m17,476s

$  ls -lh en.openfoodfacts.org.products.2023-11-24.zstd.parquet

-rw-r--r-- 1 charles charles 527M 25 nov.  11:18 en.openfoodfacts.org.products.2023-11-24.zstd.parquet

Compression + ROW_GROUP_SIZE 50000 is providing not so much differences: ~ same time as with no option, and size = 541 MB. In our context priority should be given to file size to lower the barrier to usages. Time saving during parquet creation are low (~10%) when playing duckdb options. In conclusion I would tend to use zstd compression without any other option.

In my next tests:

CharlesNepote commented 7 months ago

If we add the products that have been modified to the parquet file, we are able to request only the last version of the products with: select *, max(last_modified_date) from "products.parquet" group by code;

See: http://sqlfiddle.com/#!5/5be4d/1/0

TODO: see if it's possible to create views or some mecanisms to:

ericemc3 commented 7 months ago

Is this useful ? SELECT code, arg_max(id, ldate::date) FROM tbl GROUP BY code;

CharlesNepote commented 7 months ago

Side note: for Open Food Facts CSV this is very important to use sample_size = 3000000 to allow column types detection. Otherwise, many *_100g fields are detected as BIGINT or VARCHAR instead of DOUBLE.

All *_100g fields should be detected as DOUBLE except nutrition-score-fr_100g and nutrition-score-uk_100g which are BIGINT. Using sample_size = 3000000, the import is longer but types detection is impressive.

$ ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.csv', quote='', sample_size = 3000000) 
) TO 'en.openfoodfacts.org.products.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

$ ./duckdb x.db "DESCRIBE SELECT * FROM 'en.openfoodfacts.org.products.zstd.parquet';" -box
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      column_name                      β”‚ column_type β”‚ null β”‚ key β”‚ default β”‚ extra β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ code                                                  β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ url                                                   β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ creator                                               β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ created_t                                             β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ created_datetime                                      β”‚ TIMESTAMP   β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ last_modified_t                                       β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ last_modified_datetime                                β”‚ TIMESTAMP   β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ last_modified_by                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ product_name                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ abbreviated_product_name                              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ generic_name                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ quantity                                              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ packaging                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ packaging_tags                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ packaging_en                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ packaging_text                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ brands                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ brands_tags                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ categories                                            β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ categories_tags                                       β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ categories_en                                         β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ origins                                               β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ origins_tags                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ origins_en                                            β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ manufacturing_places                                  β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ manufacturing_places_tags                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ labels                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ labels_tags                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ labels_en                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ emb_codes                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ emb_codes_tags                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ first_packaging_code_geo                              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ cities                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ cities_tags                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ purchase_places                                       β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ stores                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ countries                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ countries_tags                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ countries_en                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ingredients_text                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ingredients_tags                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ingredients_analysis_tags                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ allergens                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ allergens_en                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ traces                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ traces_tags                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ traces_en                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ serving_size                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ serving_quantity                                      β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ no_nutrition_data                                     β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ additives_n                                           β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ additives                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ additives_tags                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ additives_en                                          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nutriscore_score                                      β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nutriscore_grade                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nova_group                                            β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ pnns_groups_1                                         β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ pnns_groups_2                                         β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ food_groups                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ food_groups_tags                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ food_groups_en                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ states                                                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ states_tags                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ states_en                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ brand_owner                                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ecoscore_score                                        β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ecoscore_grade                                        β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nutrient_levels_tags                                  β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ product_quantity                                      β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ owner                                                 β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ data_quality_errors_tags                              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ unique_scans_n                                        β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ popularity_tags                                       β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ completeness                                          β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ last_image_t                                          β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ last_image_datetime                                   β”‚ TIMESTAMP   β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ main_category                                         β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ main_category_en                                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_url                                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_small_url                                       β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_ingredients_url                                 β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_ingredients_small_url                           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_nutrition_url                                   β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ image_nutrition_small_url                             β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ energy-kj_100g                                        β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ energy-kcal_100g                                      β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ energy_100g                                           β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ energy-from-fat_100g                                  β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ fat_100g                                              β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
[...]
β”‚ nutrition-score-fr_100g                               β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nutrition-score-uk_100g                               β”‚ BIGINT      β”‚ YES  β”‚     β”‚         β”‚       β”‚
[...]
β”‚ sulphate_100g                                         β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ nitrate_100g                                          β”‚ DOUBLE      β”‚ YES  β”‚     β”‚         β”‚       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
CharlesNepote commented 7 months ago

I have started to build the bash script to create the parquet conversion AND the update of the parquet file with new products. Below for those who are curious (it's simple bash with many comments, you can try it, it should work out of the box if you're using Linux).

I'm still facing three issues.

  1. I'm facing errors from DuckDB yelling that there are UTF8 errors. But I'm not sure I can reproduce clearly. The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv Though, I find it annoying to convert the CSV and waste time and 8GB+ for this.

  2. I'm adding updated products to the previous parquet file, building a completely historized file (see #9355). But the SQL to request only the relevant data is not so easy. What works with SQLite doesn't with DuckDB. I made it work with another query (see request 7 below), but it's more complex (and maybe bad for performance).

  3. Last, but not least, historization is based on last_modified_datetime which is not taking into account "silent" updates made by Product Opener. See #8860.

#!/usr/bin/env bash

# duckdb executable path (without trailing slash)
DP=~

# TODO: gather stats and save them a log file; save info about main operations in the log file

# 0. find there is an old parquet export in the directory
[[ -f "en.openfoodfacts.org.products.parquet" ]] && OLD=1 || OLD=0

# 1. download latest CSV
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file

# Discard invalid characters
# duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
# Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
# (occuring namely on this product: https://world.openfoodfacts.org/product/9900109008673?rev=4 )
# The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv

# 2. Create new temporary parquet file. From 2 to 5 minutes, depending on your machine
[[ -f "en.openfoodfacts.org.products.tmp.parquet" ]] && rm en.openfoodfacts.org.products.tmp.parquet
$DP/duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote='', sample_size=3000000, delim='\t') 
) TO 'en.openfoodfacts.org.products.tmp.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

# 3. If a parquet file is already present, then merge the new data
if [[ $OLD = 1 ]]; then
  # Find last last_modified_date in the old parquet file
  LATEST_PRODUCT_MODIFICATION=$(
  $DP/duckdb :memory: -noheader -ascii -newline '' <<EOF
  SELECT last_modified_datetime FROM read_parquet('en.openfoodfacts.org.products.parquet')
    ORDER BY last_modified_datetime DESC LIMIT 1;
EOF
  )
  # TODO: if the last_modified_date is from today, do not update?

  # Create a temporary duckdb DB to merge current parquet file with the new data
  # (duckdb is not able to merge parket files directly)
  [[ -f "tempo.db" ]] && rm tempo.db
  $DP/duckdb tempo.db <<EOF
  CREATE TABLE products AS
    SELECT * FROM read_parquet('en.openfoodfacts.org.products.parquet');
EOF

  # Find all the products that have been modified, and insert them in the temporary duckdb DB
  $DP/duckdb tempo.db <<EOF
  INSERT INTO products
    SELECT * FROM read_parquet('en.openfoodfacts.org.products.tmp.parquet')
    WHERE last_modified_datetime > strptime('$LATEST_PRODUCT_MODIFICATION', '%Y-%m-%d %I:%M:%S');
EOF

  # Create the new parquet file based on the temporary duckdb DB
  $DP/duckdb tempo.db <<EOF
  COPY (SELECT * FROM products) TO 'en.openfoodfacts.org.products.new.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
EOF

else # if there isn't a previous parquet file, the temporary parquet file becomes the target result
  mv en.openfoodfacts.org.products.tmp.parquet en.openfoodfacts.org.products.parquet
fi

# TODO: gather some stats (nb of products, last modified product, nb of new products, etc.)

<<COMMENTS

# Requests to verify all is OK: you can try them manually by copy/pasting

# Variable to find duckdb paths:
DP=~

# 0. Request to verify the total number of products (including updated ones)
$DP/duckdb :memory: -noheader -ascii -newline '' "
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet');"

# 1. Request to verify the parquet file is coherent with good data in the right columns
# vd is Visidata, an awesome tool you have to know if you're dealing with serious data: https://www.visidata.org/
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 0.99 -- products with a good level of completeness
order by last_modified_datetime limit 10;
" | vd -f csv

# 2. Request to verify the parquet file is coherent with good data in the right columns
# (too long!!!) request (due to random() sort order), but useful to verify data are ok
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 1.2 -- products with a good level of completeness
order by random() limit 10;
" | vd -f csv

# Previous tests before using iconv
  # 3. simple request
  $DP/duckdb :memory: <<EOF
  select * from read_parquet('en.openfoodfacts.org.products.new.parquet') order by last_modified_datetime limit 5;
  EOF
  => KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
  => semble-t-il Γ  cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
  Je rΓ©essaye en ajoutant delim='\t' pour voir => mΓͺme erreur.

  # 4. request with a "where" clause, to check performance
  $DP/duckdb :memory: <<EOF
  select * from read_parquet('en.openfoodfacts.org.products.new.parquet') where countries_en like '%Germany' limit 5;
  EOF
  => KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
  => semble-t-il Γ  cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
  Je rΓ©essaye en ajoutant delim='\t' pour voir => mΓͺme erreur.

# Requests to be tested again

# 5. multiple select + aggregation to see performance
time $DP/duckdb :memory: "select * FROM
  (select count(data_quality_errors_tags) as products_with_issues 
    from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images 
    from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != '' and last_image_datetime is null);"

# 6. request to read the last version (ie the current) version of the database
$DP/duckdb :memory: <<EOF
select *, max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet')
  group by code
  order by last_modified_datetime desc
  limit 5
  ;
EOF
=> KO! Error: near line 1: Binder Error: column "creator" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(creator)" if the exact value of "creator" is not important.

# 7. request to read the last version (ie the current) version of the database (tried another way)
# TODO: create a view?
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
  where last_modified_datetime = 
    (select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
     where t1.code = t2.code)
  order by last_modified_datetime desc
  limit 5
  ;
EOF

# 8. same request as 7, but counting the number of products
$DP/duckdb :memory: <<EOF
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
  where last_modified_datetime = 
    (select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
     where t1.code = t2.code)
  ;
EOF

# 9. idem 7 but another way
$DP/duckdb :memory: <<EOF
SELECT DISTINCT ON (code)
code, * 
FROM read_parquet('en.openfoodfacts.org.products.new.parquet')
ORDER BY code, last_modified_datetime DESC;
EOF
=> KO prend Γ©normΓ©ment de mΓ©moire

COMMENTS
ericemc3 commented 6 months ago

Pour la 7, je peux suggΓ©rer :

SET threads = 6; -- je réduis le nb par défaut, ça accélère un peu

FROM 'c:\...\en.openfoodfacts.org.products.parquet'
QUALIFY rank() OVER(PARTITION BY code ORDER BY last_modified_datetime DESC) = 1;

50 secondes sur mon portable windows.

L'idΓ©al Γ©tant d'avoir constituΓ© le fichier parquet en triant au prΓ©alable sur codeet last_modified_datetime

ericemc3 commented 6 months ago

Et pour la 8 :

WITH last_modified AS (
     SELECT code FROM 'en.openfoodfacts.org.products.new.parquet'
     QUALIFY max(last_modified_datetime) OVER(PARTITION BY code) = last_modified_datetime
)
SELECT count(*) FROM last_modified ;

1 seconde

ou plus simplement, et notamment si last_modified_datetimeest parfois doublonnΓ© pour un mΓͺme code:

SELECT count(DISTINCT code) from 'en.openfoodfacts.org.products.new.parquet'