openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
654 stars 384 forks source link

Products' duplicates in the database #7706

Open CharlesNepote opened 1 year ago

CharlesNepote commented 1 year ago

Describe the bug

The database contains few products' duplicates (~30). It can be seen in several places.

It seems to be due to the _id stored as a number at first, and then as string.

To Reproduce

In the JSONL export:

$ zcat openfoodfacts-products.jsonl.gz | jq -c '. | select(.code == "0071923722898")'
# two objects are returned:
# * one containing {"_id":71923722898,"code":"0071923722898" [...]
# * an the other beginning by {"_id":"0071923722898","code":"0071923722898" [...]

In the CSV export:

$ grep "0071923722898" en.openfoodfacts.org.products.csv | cut -c 1-90
0071923722898   http://world-en.openfoodfacts.org/product/0071923722898/frosted-flakes-sweet
0071923722898   http://world-en.openfoodfacts.org/product/0071923722898/frosted-flakes-hospi

In Mirabelle (based on CSV export): http://mirabelle.openfoodfacts.org/products?sql=--+identify+duplicates%0D%0Aselect+rowid%2C+code%2C+url%2C+count%28*%29+as+%22count%22+from+%5Ball%5D+group+by+code+having+count%28*%29+%3E+1%3B This query lists all the products being duplicates as of 2022-11-16 (33).

Expected behavior

No duplicates.

CharlesNepote commented 1 year ago

I'm afraid there is a bug in the current code because the number of duplicates is increasing: 42 as of 2022-12-02 vs 33 as of 2022-11-16.

https://mirabelle.openfoodfacts.org/products?sql=--+identify+duplicates%0D%0Aselect+rowid%2C+code%2C+url%2C+count%28*%29+as+%22count%22+from+%5Ball%5D+group+by+code+having+count%28*%29+%3E+1%3B

alexgarel commented 1 year ago

relates to https://github.com/openfoodfacts/openfoodfacts-server/issues/7248 that could monitor and fix this kind of things.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity.

teolemon commented 2 months ago

Screenshot_20240817-214449.png

Screenshot_20240817-214441.png

Screenshot_20240817-214613.png

bumping to p1, as It got a conversation off topic :-/