Investigate the use of NLP models to detect vandalism on textual fields

raphael0202 commented 2 months ago

Some vandals update textual fields with swearwords or similar content. Investigate the use of NLP models (pre-trained or custom) to find these products.

teolemon commented 2 months ago

baslia commented 2 months ago

Hey, is it possible to have some data set with vandalized fields, in order to see if LLM can filter them ? ChatGPT-4o is pretty affordable so I had in mind to test it. Thanks !

raphael0202 commented 1 month ago

We don't have a dataset of vandalized products, however we have a dataset of all changes (with comments) since 2018: http://static.openfoodfacts.org/data/openfoodfacts_recent_changes.jsonl.gz. You can use this dataset to spot products that were vandalized, using the "comment" field (contributors that revert vandalism may say so in the comment). Then fetch the historical version of the product before the fix using the rev field of the API, ex: https://world.openfoodfacts.org/api/v2/product/20267605?rev=341

baslia commented 1 month ago

Thanks, that's a good starting point to get examples of vandalized products

baslia commented 1 month ago

I am starting to work on a PR

git push --set-upstream origin vandalism
ERROR: Permission to openfoodfacts/openfoodfacts-ai.git denied to baslia.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Is there any convention for the branch that I should use ? If not, can I be able to submit or create branch ?

I am planning to first find vandalized products on the historical data, then build a model to flag them. I am not sure what is the best approach yet (Machine Learning or LLM), but I will figure this part later.

Let me know if that works for you ?

Thanks! Adel

baslia commented 1 month ago

I am also stuck when I need to call the API with the rev field, while I can identify the rev version needed, I am struggling with the product id that is used in the API url:

raphael0202 commented 1 month ago

Hello @baslia, I didn't look at my Github notifications for a long time, sorry for that! If you need some assistance you can always ping me on our Slack (https://slack.openfoodfacts.org, username: raphael)

Is there any convention for the branch that I should use ? If not, can I be able to submit or create branch ?

I will give you access to the repo, then you will be able to create a branch, with a new folder for this project.

I am planning to first find vandalized products on the historical data, then build a model to flag them.

Seems like a good plan :+1:

I am also stuck when I need to call the API with the rev field, while I can identify the rev version needed, I am struggling with the product id that is used in the API url:

The "product id" is simply the code field in your table. So the URL is:

https://world.openfoodfacts.org/api/v2/product/{code}?rev={rev}

Please note that there is a rate-limit of 100 req/minutes for this endpoint, otherwise you get an HTTP 429 as a response

openfoodfacts / openfoodfacts-ai

Investigate the use of NLP models to detect vandalism on textual fields #346