openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
GNU Affero General Public License v3.0
633 stars 371 forks source link

Always use robotoff category detection (with warnings) #8346

Open CharlesNepote opened 1 year ago

CharlesNepote commented 1 year ago

Gathering categories is a huge challenge.

As of today (2023-04-21), ~1,670,000 of 2,885,092 products don't have any category, representing ~60%.

Robotoff successfully detect hundreds of thousands of new categories, but we still find false positives in these detections. Even if these false positives are ~2%, it represents dozens of thousands of products leading to bad data quality.

I think we could find a compromise.

Suggestion 1: data based on the category from AI is computed for everyone

Suggestion 2: data based on the category from AI is only displayed for web and mobile users

(Differences from suggestion 1 are bolded.)

raphael0202 commented 1 year ago

Whatever the solution chosen, we should store the result of the score computations in DB, as it would be costly to do so for each client request. The only difference is, do we store the resulting data and computations in {nutriscore,ecoscore}_{tag,grade,data} (suggestion 1) or in a new field (suggestion 2). I would be in favor of suggestion 2 (keeping predicted data separated), as it's the safest solution. It also makes a clear distinction between gold and predicted data, which is really important. Instead of ai_ prefix, I would prefer the predicted_ suffix:

If we choose solution 2, I don't think we should export these new fields in the CSV export (except maybe predicted categories?), the MongoDB dump + JSONL is sufficient, as it's quite an advanced use case.

Technically, there are some issues about how we should implement this. Product data are both stored in sto revision files at every update and in the MongoDB. Category prediction can take a bit of time (up to a few seconds, depending on whether the image embeddings are cached in DB or not). It doesn't sound feasible to me for Product Opener to wait to get the prediction response from Robotoff in order to save the product. What I would suggest:

I'm willing to hear your opinions on this @alexgarel @stephanegigandet @teolemon

CharlesNepote commented 1 year ago

@raphael0202

I would also be in favor of keeping predicted data separated.

Yes, predicted_categories_tags sounds better than ai_categories_tags.

But I don't feel very comfortable with predicted_nutriscore_grade and the other ones because it's not very clear how this field is computed. I would be more comfortable with nutriscore_grade_from_predicted_categories. In the future, we could have other predicted Nutri-Scores based on other predictions.

stephanegigandet commented 1 year ago

Can you use a _predicted suffix instead? We return JSON in alphabetical order, so it's great to see some_field and some_field_predicted next to each other. That's also what we do for nutrients: nutriments + nutriments_estimated

For the nutriscore fields, I don't think we should create new fields if we used a predicted category, but instead we can use other fields to indicate how the nutriscore was computed (e.g. with missing fibers, with estimated ingredients etc. as we already do today). Currently it's in the misc tags.

So for Nutri-Score, I very much favor solution 1.

For the Eco-Score, as the score currently strongly depends on the category (and not mostly the distinction between food and beverages), I would not computed any estimated Eco-Score based on a predicted category that has not been verified.

CharlesNepote commented 1 year ago

Thanks @stephanegigandet.

To summarize, I rewrote suggestion 1 below.

Suggestion 1: data based on the category from AI is computed for everyone

Do we agree with this?

@raphael0202: is Robotoff currently predicting categories or categories_tags?

I was also wondering if there are other use cases where the categories_tags_predicted could be used.

raphael0202 commented 1 year ago

Yes I agree @CharlesNepote.

I was also wondering if there are other use cases where the categories_tags_predicted could be used.

Maybe displaying all products that have a specific predicted category through a facet?

stephanegigandet commented 1 year ago

Maybe displaying all products that have a specific predicted category through a facet?

We can do that, but we are running out of MongoDB indexes, so it's likely queries will fail (unless the number of products is reduced by another indexed facet)

teolemon commented 1 year ago

But I don't feel very comfortable with predicted_nutriscore_grade and the other ones because it's not very clear how this field is computed. I would be more comfortable with nutriscore_grade_from_predicted_categories. In the future, we could have other predicted Nutri-Scores based on other predictions.

Big +1

teolemon commented 1 year ago

My only feedback:

CharlesNepote commented 1 year ago

@teolemon

  • We should not create a loss of chance to get the actual category.

Sure. This is why I wrote : "the user is asked to confirm what the AI has found" and the user can still edit the product and add the "real" category.

  • I think we should avoid jeopardizing the entry points to the Road to Scores (or at least put specific nudges when the colored Nutri-Score KP is displayed instead of the gray knowledge panels which has nudge to complete the category).

+1

@teolemon I rewrote my sentence: "the user is asked to confirm what the AI has found and/or the user is asked to enter the category he/she's seeing".

CharlesNepote commented 1 year ago

I think we have reached a consensus for Suggestion 1 one, except for the nutriscore_grade data field. @teolemon and (initially) I suggest it should be a different field if it is computed thanks to categories_tags_predicted. I have changed my mind and now think we can simplify this: we just have to clearly write in the documentation that:

raphael0202 commented 1 year ago

It looks good to me. About the technical implementation, I would like to know what @stephanegigandet thinks about the proposal I made above:

Technically, there are some issues about how we should implement this. Product data are both stored in sto revision files at every update and in the MongoDB. Category prediction can take a bit of time (up to a few seconds, depending on whether the image embeddings are cached in DB or not). It doesn't sound feasible to me for Product Opener to wait to get the prediction response from Robotoff in order to save the product. What I would suggest:

  • Product Opener ping Robotoff when a product update is performed (this is already implemented)
  • Robotoff compute category predictions and save them in DB (this is already done as well)
  • If the product doesn't have any category:
    • Robotoff call a new API of Product Opener, with the product barcode and the predicted categories as parameters. Product Opener update the MongoDB and the sto file, without increasing rev ID.
stephanegigandet commented 1 year ago

If the product doesn't have any category:

  • Robotoff call a new API of Product Opener, with the product barcode and the predicted categories as parameters. Product Opener update the MongoDB and the sto file, without increasing rev ID.

@raphael0202 What's the reason for not updating the rev id? Just because of the volume of predictions we think we will store?

raphael0202 commented 12 months ago

Sorry for the delay, I missed your reply see your message Stephane.

What's the reason for not updating the rev id? Just because of the volume of predictions we think we will store?

Everytime time we update a field that is one of the model input, the predictions can change (and the prediction confidence will change anyway), so it will create a new revision. Many common updates (new image, update ingredients, update nutritional values,...) will therefore trigger an additional revision, that's why I was more in favor of updating the sto directly (and it makes more sense in my opinion, we're not adding additional information here).