Closed raphael0202 closed 1 week ago
@raphael0202 related: https://github.com/openfoodfacts/openfoodfacts-server/issues/1025
Hey, is it possible to have some data set with vandalized fields, in order to see if LLM can filter them ? ChatGPT-4o is pretty affordable so I had in mind to test it. Thanks !
We don't have a dataset of vandalized products, however we have a dataset of all changes (with comments) since 2018: http://static.openfoodfacts.org/data/openfoodfacts_recent_changes.jsonl.gz.
You can use this dataset to spot products that were vandalized, using the "comment" field (contributors that revert vandalism may say so in the comment). Then fetch the historical version of the product before the fix using the rev
field of the API, ex: https://world.openfoodfacts.org/api/v2/product/20267605?rev=341
Thanks, that's a good starting point to get examples of vandalized products
I am starting to work on a PR
git push --set-upstream origin vandalism
ERROR: Permission to openfoodfacts/openfoodfacts-ai.git denied to baslia.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Is there any convention for the branch that I should use ? If not, can I be able to submit or create branch ?
I am planning to first find vandalized products on the historical data, then build a model to flag them. I am not sure what is the best approach yet (Machine Learning or LLM), but I will figure this part later.
Let me know if that works for you ?
Thanks! Adel
I am also stuck when I need to call the API with the rev field, while I can identify the rev version needed, I am struggling with the product id that is used in the API url:
Hello @baslia, I didn't look at my Github notifications for a long time, sorry for that! If you need some assistance you can always ping me on our Slack (https://slack.openfoodfacts.org, username: raphael)
Is there any convention for the branch that I should use ? If not, can I be able to submit or create branch ?
I will give you access to the repo, then you will be able to create a branch, with a new folder for this project.
I am planning to first find vandalized products on the historical data, then build a model to flag them.
Seems like a good plan :+1:
I am also stuck when I need to call the API with the rev field, while I can identify the rev version needed, I am struggling with the product id that is used in the API url:
The "product id" is simply the code
field in your table. So the URL is:
https://world.openfoodfacts.org/api/v2/product/{code}?rev={rev}
Please note that there is a rate-limit of 100 req/minutes for this endpoint, otherwise you get an HTTP 429 as a response
Some vandals update textual fields with swearwords or similar content. Investigate the use of NLP models (pre-trained or custom) to find these products.