Generate and update CSV and RDF exports through a PostgreSQL table

It currently takes 4 hours every day (with 1 cpu used all time) to generate the 4 CSV and RDF exports:

-rw-r--r--   1 off  off    913482534 May 20 08:31 en.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    925173613 May 20 08:33 fr.openfoodfacts.org.products.csv.gz
-rw-r--r--   1 off  off    457025800 May 20 08:35 en.openfoodfacts.org.products.rdf.gz
-rw-r--r--   1 off  off    457007947 May 20 08:36 fr.openfoodfacts.org.products.rdf.gz

(+ a long query to MongoDB to go through all products)

The export is done with this script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl

The script goes through products one by one, copy some fields, converts some fields (e.g. UNIX timestamps to ISO 8601), adds French and English translations for a few tag fields (categories, labels etc.), flattens some arrays into strings, and that's it.

I'm thinking that instead of going through all products in MongoDB every day, we could instead maintain a separate database similar to the one used in https://github.com/openfoodfacts/openfoodfacts-query with the same columns we export in CSV. We would then update only changed products using Redis.

There is also a RDF export that we generate at the same time as the CSV, one solution could be to generate and store the RDF for each product in a separate column.

I did a bit of an experiment here using the Postgres database that contains all the product data, as previously mentioned here: #8620

You can get a lot of the data needed using the standard Postgres COPY TO command, e.g.

copy (
select 
  code
  ,'http://world-en.openfoodfacts.org/product/' || code url
,data->'created_t' created_t
,to_char(to_timestamp((data->>'created_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') created_datetime
,data->'last_modified_t' last_modified_t
,to_char(to_timestamp((data->>'last_modified_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') last_modified_datetime
,data->>'product_name' product_name
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'packaging_tags')),', ') packaging_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'brands_tags')),', ') brands_tags
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'categories_tags')),', ') categories_tags
,data->>'ingredients_text' ingredients_text
,data->'ecoscore_data'->'score' ecoscore_score
,data->'ecoscore_data'->>'grade' ecoscore_grade
 ,array_to_string(ARRAY(SELECT json_array_elements_text(data->'data_quality_errors_tags')),', ') data_quality_errors_tags
,data->'nutriments'->'energy-kj_100g' "energy-kj_100g"
,data->'nutriments'->'energy-kcal_100g' "energy-kcal_100g"
,data->'nutriments'->'energy_100g' "energy_100g"
from product
limit 1000
) to '/mnt/e/products.tsv' DELIMITER E'\t' null as ' ' CSV HEADER;

However, there are a number of issues:

Even that limited query would take about 4 hours to run locally (it took about 5 seconds to do 1000 records)
Without the taxonomies loaded it isn't possible to generate the English and French translations of the tags
Special processing on things like the URL (appending an escaped version of the product name) would be tricky
Null handling isn't ideal
It's more difficult to trim out unwanted characters

From my previous experience with off-query I think that storing the data in a relational format rather than JSON will fix the performance issue (I did a small experiment and 1000 records took about 100ms using this approach).

My suggestion would be that we pre-process the data before loading into the off-query database to resolve issues 2, 3 and 5.

In addition, in order to address 2, we would need the taxonomies to be loaded into off-query (I have some code to do that) and we would need to canonicalize all of the tag values of the product before saving so that we can do a direct SQL join to the taxonomy translations table.

Adding this extra processing in the off-query import code will slow down the import a bit, but shouldn't have a big impact on the incremental sync from Redis.

In summary, the work items to address this using an off-query approach would be:

[ ] Add the extra fields needed to the off-query database
[ ] Add code to off-query to ensure that all taxonomy / tag values are canonicalized
[ ] Import taxonomies into off-query with a mechanism to refresh (e.g. when PO is deployed or just every night)
[ ] Code up the scheduled export from off-query

openfoodfacts / openfoodfacts-server

Generate and update CSV and RDF exports through a PostgreSQL table #10326