Explore clustering of products based on their images

raphael0202 commented 1 year ago

Open Food Facts database contains about 6 million images of food products. Up to now, we have never used this dataset to try to cluster or classify products based on their images.

Depending on the results, possible use cases are:

detect duplicated products
predict product category
enabling a new (visual) way of exploring Open Food Facts

Steps

Download images of a subset (~200k products) of Open Food Facts products.
- Make sure this dataset does contain non-French products (currently France still accounts for ~50% of all products).
- We have resized versions of images available (ex: https://world-fr.openfoodfacts.org/images/products/871/132/737/4171/8.400.jpg), it may be a good idea to use those.
- Be gentle with Open Food Facts servers, limit the number of parallel image downloads and notify us on Slack when you start bulk downloading images.
Generate embeddings of these images using a computer vision model
- if you're unsure of which models to try, here is a list that was used for another project that you may find relevant: https://openfoodfacts.github.io/robotoff/research/logo-detection/benchmark/
Compute nearest neighbors for each image (on our benchmark, cosine similarity is a good metric).
Build a visualization demo (we've used Streamlit extensively at Open Food Facts and it works well). Possible visualization: select a random image in the dataset and display nearest neighbor.

aadya940 commented 1 year ago

If our application is to detect duplicated products from Images, Can we use a Siamese Network?

cfrancois7 commented 1 year ago

@aadya940 Depends. A siamese network is perfect to build a metric to evaluate similarity between two images. This metric should be learn from correct dataset. But what we want is to clean dataset by dropping the duplicates. Best is just to look at the embeddings after a projector model such as CNN / ViT view as feature extractor.

In my point of view, the challenge is much more how to reduce dimensionnality to detect not clear duplicates but "almost" duplicates, for example picture with a little horizontal/vertical shift.

I'll base my work on this paper: https://www.nature.com/articles/s42003-022-03628-x

@raphael0202 I download the JSON dump but I did not succeed to find the dump of small images. How to retrieve them properly? What we need to keep as "id" to keep track of which images are duplicates? Thanks in advance

raphael0202 commented 1 year ago

Hi @cfrancois7! How to fetch images is described in the wiki here: https://wiki.openfoodfacts.org/Developer-How_To If you're interested in (near) duplicate detections, I've found this blog post which summarizes nicely effective (non-ml) techniques to perform this task: https://kandepet.com/detecting-similar-and-identical-images-using-perseptual-hashes/ The image path (/252/506/2524/1.jpg for example) is a unique identifier of the image, as it includes both the product barcode and the image (auto-incrementing) ID. For information, a ML teacher told me last week he was about to start working on this project (=exploration of products through image embeddings) with his students this week, but nobody is on near duplicate image detection :)

openfoodfacts / openfoodfacts-ai

Explore clustering of products based on their images #203

Steps