openfoodfacts / openfoodfacts-ai

This is a tracking repo for all our AI projects. 🍕 🤖🍼
226 stars 53 forks source link

Extract automatically ingredient list from OCR #242

Open raphael0202 opened 1 year ago

raphael0202 commented 1 year ago

Why is it important

Knowing the ingredients of products is really important in Open Food Facts, as the ingredient list is used to compute the NOVA group (transformation score) and to inform users with allergies or intolerance that some products are not suitable for them. It's also likely that the ingredient list is going to be used in future versions of the Ecoscore, the environmental score used on Open Food Facts.

The current process for ingredient extraction is the following:

This manual approach takes time, and most contributors don't extract the ingredients. As of December 2022, on 2.7M products, 1.9M don't have a completed ingredient list.

Proposal

We would like to extract automatically the ingredient list from image OCRs. As OCR is performed on all images, we already have the text, what needs to be done is to find the beginning and ending of the ingredient list.

A sequence tagger (NER-like model) can be trained to detect the beginning and end of the ingredient list (if any). Open Food Facts is a global food database, so don't expect a single language to be present on the photos: the detector should work on at least the most common languages (FR, EN, ES, DE, IT, NL...).

We don't have any labeled dataset for this task.

Google Cloud Vision (the service we use for OCR) doesn't always detect well line continuation (how to link detected words to form a sentence), but based on a manual analysis of ingredient list images, this issue occurs in ~4.1% of cases (9/217). We therefore rely on Cloud Vision block detection, keeping in mind that the ingredient list may be occasionally split in several parts due to incorrect block detection.

Documentation about OCR process: https://wiki.openfoodfacts.org/OCR

Requirements

You can use the framework you like, but the model should be exportable either in ONNX or SavedModel format (we use Triton to serve ML models).

raphael0202 commented 1 year ago

The annotation campaign has started here: https://annotate.openfoodfacts.org/projects/1/data