openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
GNU Affero General Public License v3.0
615 stars 359 forks source link

Generate and update CSV and RDF exports through a PostgreSQL table #10326

Open stephanegigandet opened 2 months ago

stephanegigandet commented 2 months ago

It currently takes 4 hours every day (with 1 cpu used all time) to generate the 4 CSV and RDF exports:

-rw-r--r--   1 off  off    913482534 May 20 08:31 en.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    925173613 May 20 08:33 fr.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    457025800 May 20 08:35 en.openfoodfacts.org.products.rdf.gz
-rw-r--r--   1 off  off    457007947 May 20 08:36 fr.openfoodfacts.org.products.rdf.gz

(+ a long query to MongoDB to go through all products)

The export is done with this script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl

The script goes through products one by one, copy some fields, converts some fields (e.g. UNIX timestamps to ISO 8601), adds French and English translations for a few tag fields (categories, labels etc.), flattens some arrays into strings, and that's it.

I'm thinking that instead of going through all products in MongoDB every day, we could instead maintain a separate database similar to the one used in https://github.com/openfoodfacts/openfoodfacts-query with the same columns we export in CSV. We would then update only changed products using Redis.

There is also a RDF export that we generate at the same time as the CSV, one solution could be to generate and store the RDF for each product in a separate column.

john-gom commented 2 months ago

I did a bit of an experiment here using the Postgres database that contains all the product data, as previously mentioned here: #8620

You can get a lot of the data needed using the standard Postgres COPY TO command, e.g.

copy (
select 
  code
  ,'http://world-en.openfoodfacts.org/product/' || code url
,data->'created_t' created_t
,to_char(to_timestamp((data->>'created_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') created_datetime
,data->'last_modified_t' last_modified_t
,to_char(to_timestamp((data->>'last_modified_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') last_modified_datetime
,data->>'product_name' product_name
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'packaging_tags')),', ') packaging_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'brands_tags')),', ') brands_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'categories_tags')),', ') categories_tags
,data->>'ingredients_text' ingredients_text
,data->'ecoscore_data'->'score' ecoscore_score
,data->'ecoscore_data'->>'grade' ecoscore_grade
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'data_quality_errors_tags')),', ') data_quality_errors_tags
,data->'nutriments'->'energy-kj_100g' "energy-kj_100g"
,data->'nutriments'->'energy-kcal_100g' "energy-kcal_100g"
,data->'nutriments'->'energy_100g' "energy_100g"
from product
limit 1000
) to '/mnt/e/products.tsv' DELIMITER E'\t' null as ' ' CSV HEADER;

However, there are a number of issues:

  1. Even that limited query would take about 4 hours to run locally (it took about 5 seconds to do 1000 records)
  2. Without the taxonomies loaded it isn't possible to generate the English and French translations of the tags
  3. Special processing on things like the URL (appending an escaped version of the product name) would be tricky
  4. Null handling isn't ideal
  5. It's more difficult to trim out unwanted characters

From my previous experience with off-query I think that storing the data in a relational format rather than JSON will fix the performance issue (I did a small experiment and 1000 records took about 100ms using this approach).

My suggestion would be that we pre-process the data before loading into the off-query database to resolve issues 2, 3 and 5.

In addition, in order to address 2, we would need the taxonomies to be loaded into off-query (I have some code to do that) and we would need to canonicalize all of the tag values of the product before saving so that we can do a direct SQL join to the taxonomy translations table.

Adding this extra processing in the off-query import code will slow down the import a bit, but shouldn't have a big impact on the incremental sync from Redis.

In summary, the work items to address this using an off-query approach would be: