openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
GNU Affero General Public License v3.0
631 stars 366 forks source link

CSV export: add the main language of each product, and delete useless fields #9563

Open CharlesNepote opened 8 months ago

CharlesNepote commented 8 months ago

This issue is continuing the conversation of #2325 (I think we could close it).

Fields to add

lang to address the ingredients_text issue

The field ingredients_text is interesting because, corresponding to main language of the product, it is the most likely to be filled. It would not be useful to export ingredients_en, ingredients_fr, etc. because many of them would be empty. So in the past we have chosen:

That said, there is no way to know what IS the main language for each product. Should we add either: lc, ingredients_lc or lang?

obsolete_since_date field

Many producers are sending us information when products are obsolete. We should add it to the CSV for many reasons:

rev field

This field represents the number of revisions of a product. As it is short, it's not very costly. It would allow to:

unknown_ingredients_n

This would allow to better investigate how to improve/prioritize ingredients' quality.

The number of photos

It is a good proxy for products' popularity. It can be also a way to know if the product has good chances to be fixed. It also allows to monitor the products with new photos: for example the ones where photos are not selected. As the field is just a number, it isn't too costly.

Useless fields

On the other hand we should try no to modify the CSV too often. So I would be in favor to delete useless fields at the same time:

Exceptions

Some fields in the CSV doesn't have an _en equivalent.

Curiously, we have main_category and main_category_en, but not main_category_tag.

Redundant date fields

Should we also remove all the fields ending with a _t (unix epoch format), redundant with _datetime fields? 60 Mb are lost due to this redundancy.

benbenben2 commented 8 months ago

I see the @export_fields in Config_off.pm which contain the first fields that you mentioned.

For the number of photos, in the api result we have something like this:

"images":{"1":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1535370936,"uploader":"kiliweb"},"10":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":658,"w":3024}},"uploaded_t":1610373128,"uploader":"kiliweb"},"11":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335243,"uploader":"moon-rabbit"},"12":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335252,"uploader":"openfoodfacts-contributors"},"13":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335260,"uploader":"openfoodfacts-contributors"},"14":{"sizes":{"100":{"h":100,"w":56},"400":{"h":400,"w":225},"full":{"h":1280,"w":720}},"uploaded_t":1685811820,"uploader":"insectproductadd"},"15":{"sizes":{"100":{"h":17,"w":100},"400":{"h":68,"w":400},"full":{"h":330,"w":1949}},"uploaded_t":1693411520,"uploader":"mismer"},"16":{"sizes":{"100":{"h":100,"w":89},"400":{"h":400,"w":356},"full":{"h":671,"w":597}},"uploaded_t":1693411555,"uploader":"mismer"},"2":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":562,"w":2583}},"uploaded_t":1535370948,"uploader":"kiliweb"},"3":{"sizes":{"100":{"h":50,"w":100},"400":{"h":199,"w":400},"full":{"h":2431,"w":4896}},"uploaded_t":1538851611,"uploader":"anticultist"},"4":{"sizes":{"100":{"h":36,"w":100},"400":{"h":146,"w":400},"full":{"h":1582,"w":4338}},"uploaded_t":1538851798,"uploader":"anticultist"},"5":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":924,"w":4213}},"uploaded_t":1538851824,"uploader":"anticultist"},"6":{"sizes":{"100":{"h":40,"w":100},"400":{"h":160,"w":400},"full":{"h":481,"w":1200}},"uploaded_t":1547153415,"uploader":"twoflower"},"7":{"sizes":{"100":{"h":42,"w":100},"400":{"h":169,"w":400},"full":{"h":508,"w":1200}},"uploaded_t":1547153419,"uploader":"twoflower"},"8":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":264,"w":1200}},"uploaded_t":1547153424,"uploader":"twoflower"},"9":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1610373126,"uploader":"kiliweb"},

Not sure if we can do a count of this

Everything is not in this Config_off.pm variable. For example, if we export a csv, we have some columns like "packaging_1_number_of_units" or "packaging_1_shape". This is not clear to me where it is defined.

Also, I do not see all these "countries_en", "categories_en", etc.

Maybe I am not looking at the same csv...