Predict an image crop of nutrition tables

This issue is meant to track the progress and previous work made on nutrition table detection and cropping.

Previous work

As part of Google Summer of Code 2018, a student (Sagar) trained an object detection model to detect nutritional tables based on an annotated dataset of nutrition table images. This model has never been integrated into Robotoff. In December 2019, using Sagar training dataset as baseline, I created a new annotation campaign to enrich the dataset and fix some errors, raising the number of samples to ~1k. A new model was trained from this dataset using Tensorflow Object Detection API library: https://github.com/openfoodfacts/robotoff-models/releases/tag/tf-nutrition-table-1.0 This is the current object detection model used in production. Until really recently, we didn't use this model predictions. As of 26th of May 2023, we detect nutrition images based on nutrient mentions ("salt", "energy", "saturated fat",...) and values ("15g", "255 kcal",...). The object detection model prediction is only used to predict an image crop, when the model confidence is really high (>= 0.9). Indeed, the model predictions are not reliable enough to use a lower threshold. Consequently, we noticed that most nutrition_image predictions don't have a predicted crop, which means we use the full image as selected image. This is something we would like to change to switch to a fully automated nutrition image selection & crop.

Proposal

I started to implement on Robotoff a simple algorithm to predict a crop based on nutrient mention and values: the idea was to select the minimal bounding box that includes all detected nutrient mentions/values. It improved the result over uncropped images. However, we still had some issues:

words that were ingredients but detected as nutrient mention ("sugar", "salt",..) or product weight detected as nutrient values ("25g") were included in the crop
recommended daily intake percentages of nutrients were not necessarily included in the table, as it's not something we detect as nutrient mention.

I started to think about a way of performing clustering to deal with outliers to solve issue (1), when I realized we could use a supervised machine learning model to detect words that belong to the nutritional table, using as input:

the bounding box of the word
the word string content
whether it's a detected nutrient mention or value (optional)

Using the annotated dataset + JSON OCRs, we train a graph model that predict if each word is part of the nutrition table, based on the word content + neighbors. The object detection model only uses the raw image as input and doesn't have access to the text content, which explains why it doesn't perform this well (it probably only considers table shapes to predict nutrition tables).

I expect this model to perform much better than the object detector. It would also detect nutrition information displayed as text, which is something the object detector struggled to do (unsurprisingly).

openfoodfacts / openfoodfacts-ai

Predict an image crop of nutrition tables #311

Previous work

Proposal