Open CharlesNepote opened 1 year ago
Whatever the solution chosen, we should store the result of the score computations in DB, as it would be costly to do so for each client request. The only difference is, do we store the resulting data and computations in {nutriscore,ecoscore}_{tag,grade,data}
(suggestion 1) or in a new field (suggestion 2).
I would be in favor of suggestion 2 (keeping predicted data separated), as it's the safest solution. It also makes a clear distinction between gold and predicted data, which is really important.
Instead of ai_
prefix, I would prefer the predicted_
suffix:
predicted_categories_tags
predicted_nutriscore_tag
predicted_nutriscore_grade
predicted_nutriscore_data
predicted_ecoscore_tag
predicted_ecoscore_grade
predicted_ecoscore_data
If we choose solution 2, I don't think we should export these new fields in the CSV export (except maybe predicted categories?), the MongoDB dump + JSONL is sufficient, as it's quite an advanced use case.
Technically, there are some issues about how we should implement this. Product data are both stored in sto revision files at every update and in the MongoDB. Category prediction can take a bit of time (up to a few seconds, depending on whether the image embeddings are cached in DB or not). It doesn't sound feasible to me for Product Opener to wait to get the prediction response from Robotoff in order to save the product. What I would suggest:
I'm willing to hear your opinions on this @alexgarel @stephanegigandet @teolemon
@raphael0202
I would also be in favor of keeping predicted data separated.
Yes, predicted_categories_tags
sounds better than ai_categories_tags
.
But I don't feel very comfortable with predicted_nutriscore_grade
and the other ones because it's not very clear how this field is computed. I would be more comfortable with nutriscore_grade_from_predicted_categories
. In the future, we could have other predicted Nutri-Scores based on other predictions.
Can you use a _predicted suffix instead? We return JSON in alphabetical order, so it's great to see some_field and some_field_predicted next to each other. That's also what we do for nutrients: nutriments + nutriments_estimated
For the nutriscore fields, I don't think we should create new fields if we used a predicted category, but instead we can use other fields to indicate how the nutriscore was computed (e.g. with missing fibers, with estimated ingredients etc. as we already do today). Currently it's in the misc tags.
So for Nutri-Score, I very much favor solution 1.
For the Eco-Score, as the score currently strongly depends on the category (and not mostly the distinction between food and beverages), I would not computed any estimated Eco-Score based on a predicted category that has not been verified.
Thanks @stephanegigandet.
To summarize, I rewrote suggestion 1 below.
[ ] When Robotoff detects a category, it stores it in a new dedicated field (categories_tags_predicted
), and leaves categories
empty.
[ ] When Robotoff has detected a category (categories_tags_predicted
) AND categories_tags
is empty
[ ] When a user open a web or mobile page where there is no categories_tags
but categories_tags_predicted
[ ] When an API call asks for the Nutri-Score, it should be aware that the category is detected by an IA: the field categories_tags_predicted
could be automatically returned (even if the user does not ask for it)
[ ] The field categories_tags_predicted
should be exported in the CSV file and other kind of exports, to allow reusers to use (or not) the Nutri-Score based on the categories_tags_predicted
values
[ ] The documentation is clearly stating that:
categories_tags
is empty and categories_tags_predicted
is not, nutriscore_grade
is computed with the help of categories_tags_predicted
categories_tags
is not empty nutriscore_grade
is based on it, ignoring whether categories_tags_predicted
is completed or not[ ] Data quality errors should make a difference between regular errors and errors where the category could play a role
Do we agree with this?
@raphael0202: is Robotoff currently predicting categories
or categories_tags
?
I was also wondering if there are other use cases where the categories_tags_predicted
could be used.
Yes I agree @CharlesNepote.
I was also wondering if there are other use cases where the categories_tags_predicted could be used.
Maybe displaying all products that have a specific predicted category through a facet?
Maybe displaying all products that have a specific predicted category through a facet?
We can do that, but we are running out of MongoDB indexes, so it's likely queries will fail (unless the number of products is reduced by another indexed facet)
But I don't feel very comfortable with
predicted_nutriscore_grade
and the other ones because it's not very clear how this field is computed. I would be more comfortable withnutriscore_grade_from_predicted_categories
. In the future, we could have other predicted Nutri-Scores based on other predictions.
Big +1
My only feedback:
@teolemon
- We should not create a loss of chance to get the actual category.
Sure. This is why I wrote : "the user is asked to confirm what the AI has found" and the user can still edit the product and add the "real" category.
- I think we should avoid jeopardizing the entry points to the Road to Scores (or at least put specific nudges when the colored Nutri-Score KP is displayed instead of the gray knowledge panels which has nudge to complete the category).
+1
@teolemon I rewrote my sentence: "the user is asked to confirm what the AI has found and/or the user is asked to enter the category he/she's seeing".
I think we have reached a consensus for Suggestion 1 one, except for the nutriscore_grade
data field.
@teolemon and (initially) I suggest it should be a different field if it is computed thanks to categories_tags_predicted
. I have changed my mind and now think we can simplify this: we just have to clearly write in the documentation that:
categories_tags
is empty and categories_tags_predicted
is not, nutriscore_grade
is computed with the help of categories_tags_predicted
categories_tags
is not empty nutriscore_grade
is based on it, ignoring whethercategories_tags_predicted
is completed or notIt looks good to me. About the technical implementation, I would like to know what @stephanegigandet thinks about the proposal I made above:
Technically, there are some issues about how we should implement this. Product data are both stored in sto revision files at every update and in the MongoDB. Category prediction can take a bit of time (up to a few seconds, depending on whether the image embeddings are cached in DB or not). It doesn't sound feasible to me for Product Opener to wait to get the prediction response from Robotoff in order to save the product. What I would suggest:
- Product Opener ping Robotoff when a product update is performed (this is already implemented)
- Robotoff compute category predictions and save them in DB (this is already done as well)
- If the product doesn't have any category:
- Robotoff call a new API of Product Opener, with the product barcode and the predicted categories as parameters. Product Opener update the MongoDB and the sto file, without increasing rev ID.
If the product doesn't have any category:
- Robotoff call a new API of Product Opener, with the product barcode and the predicted categories as parameters. Product Opener update the MongoDB and the sto file, without increasing rev ID.
@raphael0202 What's the reason for not updating the rev id? Just because of the volume of predictions we think we will store?
Sorry for the delay, I missed your reply see your message Stephane.
What's the reason for not updating the rev id? Just because of the volume of predictions we think we will store?
Everytime time we update a field that is one of the model input, the predictions can change (and the prediction confidence will change anyway), so it will create a new revision. Many common updates (new image, update ingredients, update nutritional values,...) will therefore trigger an additional revision, that's why I was more in favor of updating the sto directly (and it makes more sense in my opinion, we're not adding additional information here).
Gathering categories is a huge challenge.
As of today (2023-04-21), ~1,670,000 of 2,885,092 products don't have any category, representing ~60%.
Robotoff successfully detect hundreds of thousands of new categories, but we still find false positives in these detections. Even if these false positives are ~2%, it represents dozens of thousands of products leading to bad data quality.
I think we could find a compromise.
Suggestion 1: data based on the category from AI is computed for everyone
categories
empty; eg.ia_categories_tags
categories_tags
butia_categories_tags
ia_categories_tags
could be automatically returnedia_categories_tags
should be exported in the CSV file and other kind of exports, to allow reusers to use (or not) the data computed based on theia_categories_tags
valuesSuggestion 2: data based on the category from AI is only displayed for web and mobile users
(Differences from suggestion 1 are bolded.)
categories
empty; eg.ia_categories_tags
categories_tags
butia_categories_tags
ia_categrories_tags
ia_categrories_tags
ia_categories_tags
should be exported in the CSV fileia_categories_tags
could play a role