openfoodfacts / off-category-classification

GNU Affero General Public License v3.0
8 stars 5 forks source link

Use NLP on OCR as a complementary feature for off-category-classification #7

Open alexgarel opened 2 years ago

alexgarel commented 2 years ago

First idea is to inject whole OCR into features (maybe distinct from other features) and see what we can do. As this has lot of noise, "some attention" mechanism might be necessary.

Follow the line of having pre treatments in the tensorflow pipeline to ease model deployment.

teolemon commented 2 years ago

http://static.openfoodfacts.org/images/products/ocr.jsonl.gz

alexgarel commented 2 years ago

In the above file:

The field "source" give source file name, eg: "source": "/50414727/1.json", the field "content" contains OCR data.

barcode is in the folders name, file name correspond to photo number.

For barcode, see https://github.com/openfoodfacts/robotoff/blob/master/robotoff/off.py#L88