Open stephanegigandet opened 6 months ago
I did a bit of an experiment here using the Postgres database that contains all the product data, as previously mentioned here: #8620
You can get a lot of the data needed using the standard Postgres COPY TO command, e.g.
copy (
select
code
,'http://world-en.openfoodfacts.org/product/' || code url
,data->'created_t' created_t
,to_char(to_timestamp((data->>'created_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') created_datetime
,data->'last_modified_t' last_modified_t
,to_char(to_timestamp((data->>'last_modified_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') last_modified_datetime
,data->>'product_name' product_name
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'packaging_tags')),', ') packaging_tags
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'brands_tags')),', ') brands_tags
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'categories_tags')),', ') categories_tags
,data->>'ingredients_text' ingredients_text
,data->'ecoscore_data'->'score' ecoscore_score
,data->'ecoscore_data'->>'grade' ecoscore_grade
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'data_quality_errors_tags')),', ') data_quality_errors_tags
,data->'nutriments'->'energy-kj_100g' "energy-kj_100g"
,data->'nutriments'->'energy-kcal_100g' "energy-kcal_100g"
,data->'nutriments'->'energy_100g' "energy_100g"
from product
limit 1000
) to '/mnt/e/products.tsv' DELIMITER E'\t' null as ' ' CSV HEADER;
However, there are a number of issues:
From my previous experience with off-query I think that storing the data in a relational format rather than JSON will fix the performance issue (I did a small experiment and 1000 records took about 100ms using this approach).
My suggestion would be that we pre-process the data before loading into the off-query database to resolve issues 2, 3 and 5.
In addition, in order to address 2, we would need the taxonomies to be loaded into off-query (I have some code to do that) and we would need to canonicalize all of the tag values of the product before saving so that we can do a direct SQL join to the taxonomy translations table.
Adding this extra processing in the off-query import code will slow down the import a bit, but shouldn't have a big impact on the incremental sync from Redis.
In summary, the work items to address this using an off-query approach would be:
It currently takes 4 hours every day (with 1 cpu used all time) to generate the 4 CSV and RDF exports:
(+ a long query to MongoDB to go through all products)
The export is done with this script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl
The script goes through products one by one, copy some fields, converts some fields (e.g. UNIX timestamps to ISO 8601), adds French and English translations for a few tag fields (categories, labels etc.), flattens some arrays into strings, and that's it.
I'm thinking that instead of going through all products in MongoDB every day, we could instead maintain a separate database similar to the one used in https://github.com/openfoodfacts/openfoodfacts-query with the same columns we export in CSV. We would then update only changed products using Redis.
There is also a RDF export that we generate at the same time as the CSV, one solution could be to generate and store the RDF for each product in a separate column.