Knowing the ingredients of products is really important in Open Food Facts, as the ingredient list is used to compute the NOVA group (transformation score) and to inform users with allergies or intolerance that some products are not suitable for them.
It's also likely that the ingredient list is going to be used in future versions of the Ecoscore, the environmental score used on Open Food Facts.
The current process for ingredient extraction is the following:
a user uploads an image, crops it only to keep the ingredient list and selects it as an ingredient image
the user extracts the ingredient list by clicking on a button (OCR is performed)
the user may modify the text if some OCR errors occurred, and saves the product
This manual approach takes time, and most contributors don't extract the ingredients. As of December 2022, on 2.7M products, 1.9M don't have a completed ingredient list.
Proposal
We would like to extract automatically the ingredient list from image OCRs. As OCR is performed on all images, we already have the text, what needs to be done is to find the beginning and ending of the ingredient list.
A sequence tagger (NER-like model) can be trained to detect the beginning and end of the ingredient list (if any).
Open Food Facts is a global food database, so don't expect a single language to be present on the photos: the detector should work on at least the most common languages (FR, EN, ES, DE, IT, NL...).
We don't have any labeled dataset for this task.
Google Cloud Vision (the service we use for OCR) doesn't always detect well line continuation (how to link detected words to form a sentence), but based on a manual analysis of ingredient list images, this issue occurs in ~4.1% of cases (9/217).
We therefore rely on Cloud Vision block detection, keeping in mind that the ingredient list may be occasionally split in several parts due to incorrect block detection.
Why is it important
Knowing the ingredients of products is really important in Open Food Facts, as the ingredient list is used to compute the NOVA group (transformation score) and to inform users with allergies or intolerance that some products are not suitable for them. It's also likely that the ingredient list is going to be used in future versions of the Ecoscore, the environmental score used on Open Food Facts.
The current process for ingredient extraction is the following:
This manual approach takes time, and most contributors don't extract the ingredients. As of December 2022, on 2.7M products, 1.9M don't have a completed ingredient list.
Proposal
We would like to extract automatically the ingredient list from image OCRs. As OCR is performed on all images, we already have the text, what needs to be done is to find the beginning and ending of the ingredient list.
A sequence tagger (NER-like model) can be trained to detect the beginning and end of the ingredient list (if any). Open Food Facts is a global food database, so don't expect a single language to be present on the photos: the detector should work on at least the most common languages (FR, EN, ES, DE, IT, NL...).
We don't have any labeled dataset for this task.
Google Cloud Vision (the service we use for OCR) doesn't always detect well line continuation (how to link detected words to form a sentence), but based on a manual analysis of ingredient list images, this issue occurs in ~4.1% of cases (9/217). We therefore rely on Cloud Vision block detection, keeping in mind that the ingredient list may be occasionally split in several parts due to incorrect block detection.
Documentation about OCR process: https://wiki.openfoodfacts.org/OCR
Requirements
You can use the framework you like, but the model should be exportable either in ONNX or SavedModel format (we use Triton to serve ML models).