nlp-with-transformers / notebooks

Jupyter notebooks for the Natural Language Processing with Transformers book
https://transformersbook.com/
Apache License 2.0
3.85k stars 1.19k forks source link

Codeparrot dataset flagged by Hugging Face as unsafe #128

Closed pantelis closed 10 months ago

pantelis commented 10 months ago

Information

The question or comment is about chapter:

Question or comment

In https://huggingface.co/datasets/transformersbook/codeparrot there is one file that is flagged as unsafe with information "Virus: Legacy.Trojan.Agent-37025:. Is this verified by anyone else or its a false positive of the virus scanners used by HF ? In any case does nayone know of / used any alternative dataset ?

pantelis commented 10 months ago

Closing this issue after going through some posts in the HF community eg https://discuss.huggingface.co/t/trojan-in-common-voice-dataset/18155 that indicate that indeed they are false positives as the files may contain the source code of viruses or some string that matches a signature of a virus.