Open CharlesNepote opened 11 months ago
I see the @export_fields in Config_off.pm which contain the first fields that you mentioned.
For the number of photos, in the api result we have something like this:
"images":{"1":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1535370936,"uploader":"kiliweb"},"10":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":658,"w":3024}},"uploaded_t":1610373128,"uploader":"kiliweb"},"11":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335243,"uploader":"moon-rabbit"},"12":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335252,"uploader":"openfoodfacts-contributors"},"13":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335260,"uploader":"openfoodfacts-contributors"},"14":{"sizes":{"100":{"h":100,"w":56},"400":{"h":400,"w":225},"full":{"h":1280,"w":720}},"uploaded_t":1685811820,"uploader":"insectproductadd"},"15":{"sizes":{"100":{"h":17,"w":100},"400":{"h":68,"w":400},"full":{"h":330,"w":1949}},"uploaded_t":1693411520,"uploader":"mismer"},"16":{"sizes":{"100":{"h":100,"w":89},"400":{"h":400,"w":356},"full":{"h":671,"w":597}},"uploaded_t":1693411555,"uploader":"mismer"},"2":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":562,"w":2583}},"uploaded_t":1535370948,"uploader":"kiliweb"},"3":{"sizes":{"100":{"h":50,"w":100},"400":{"h":199,"w":400},"full":{"h":2431,"w":4896}},"uploaded_t":1538851611,"uploader":"anticultist"},"4":{"sizes":{"100":{"h":36,"w":100},"400":{"h":146,"w":400},"full":{"h":1582,"w":4338}},"uploaded_t":1538851798,"uploader":"anticultist"},"5":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":924,"w":4213}},"uploaded_t":1538851824,"uploader":"anticultist"},"6":{"sizes":{"100":{"h":40,"w":100},"400":{"h":160,"w":400},"full":{"h":481,"w":1200}},"uploaded_t":1547153415,"uploader":"twoflower"},"7":{"sizes":{"100":{"h":42,"w":100},"400":{"h":169,"w":400},"full":{"h":508,"w":1200}},"uploaded_t":1547153419,"uploader":"twoflower"},"8":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":264,"w":1200}},"uploaded_t":1547153424,"uploader":"twoflower"},"9":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1610373126,"uploader":"kiliweb"},
Not sure if we can do a count of this
Everything is not in this Config_off.pm variable. For example, if we export a csv, we have some columns like "packaging_1_number_of_units" or "packaging_1_shape". This is not clear to me where it is defined.
Also, I do not see all these "countries_en", "categories_en", etc.
Maybe I am not looking at the same csv...
This issue is continuing the conversation of #2325 (I think we could close it).
Fields to add
lang
to address the ingredients_text issueThe field
ingredients_text
is interesting because, corresponding to main language of the product, it is the most likely to be filled. It would not be useful to export ingredients_en, ingredients_fr, etc. because many of them would be empty. So in the past we have chosen:ingredients_text
ingredients_text
:ingredients_tags
. Eg.en:brown-sugar
.That said, there is no way to know what IS the main language for each product. Should we add either:
lc
,ingredients_lc
orlang
?obsolete_since_date
fieldMany producers are sending us information when products are obsolete. We should add it to the CSV for many reasons:
rev
fieldThis field represents the number of revisions of a product. As it is short, it's not very costly. It would allow to:
unknown_ingredients_n
This would allow to better investigate how to improve/prioritize ingredients' quality.
The number of photos
It is a good proxy for products' popularity. It can be also a way to know if the product has good chances to be fixed. It also allows to monitor the products with new photos: for example the ones where photos are not selected. As the field is just a number, it isn't too costly.
Useless fields
On the other hand we should try no to modify the CSV too often. So I would be in favor to delete useless fields at the same time:
countries
: this field can mix data in different, it's better to rely on thecountry_tags
(eg. en:united-kingdom)which is a normalized version of the countries.countries_en
(eg.United Kingdom
) is here for comfort. But we could also remove it.categories
andcategories_en
labels
andlabels_en
packaging
andpackaging_en
origins
andorigins_en
traces
andtraces_en
additives
andadditives_en
food_groups
andfood_groups_en
:food_groups
is always in a normalized way??states
andstates_en
: same remark asfood_groups
Forfood_groups
andstates
, at least,states
andstates_tags
are almost identical, the only difference is that states contains spacesExceptions
Some fields in the CSV doesn't have an
_en
equivalent.manufacturing_places
: we havemanufacturing_places_tags
but we don't havemanufacturing_places_en
emb_codes
,cities
,allergens
. Should we only keep the_tags
fields?Curiously, we have
main_category
andmain_category_en
, but notmain_category_tag
.Redundant date fields
Should we also remove all the fields ending with a
_t
(unix epoch format), redundant with_datetime
fields? 60 Mb are lost due to this redundancy.