internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
114 stars 18 forks source link

Spam filter #64

Open bnewbold opened 3 years ago

bnewbold commented 3 years ago

It would be useful to have a naive function that looks at release metadata and detects gratuitous spam. In theory upstream partner sources should be able to catch spam, but, eg, today Zenodo had more than 25,000 spam DOIs and PDFs registered:

https://fatcat.wiki/release/search?q=doi_prefix%3A10.5281+date%3A2020-11-02

Most of these have terms like [PDF], EPUB, D.O.W.N.L.O.A.D, etc, which seem like simple statistical spam detection could find. The goal wouldn't be to make something impenetrable, just to prevent large batches from getting imported. If we had such a function in one place, we could add additional patterns over time, and reuse the function in both automated bot imports (eg, like datacite DOI metadata here) and in a review bot for human edits.

AniketShahane commented 3 years ago

@bnewbold Hey I'd like to work on this. So I was thinking we could implement something extremely simple like a naive bayes classifier. But before that I've scraped all those spam messages and created a csv file that contains the occurrences of words in the spam messages. Is there a way we can right away make use of that here and then go on to work on the naive bayes classifier?

bnewbold commented 3 years ago

One idea would be to create a function that identifies all (or almost all) of the spam DOIs, and none from a random subset of "real" releases.

The best current place to implement this would be in the "importer" "common" file: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/common.py#L115

A new function like is_spam_release(obj: ReleaseEntity) -> boolean. Maybe should be a method on the EntityImporter class if state needs to be loaded (eg, a file of patterns). I think operating on the release object instead of on a string makes the most sense; initially the function could only check the title field, but could get more or less complex in the future. If you implement such a function, I can wire it up with our existing importer code paths in a separate commit/PR.

Implementation should include tests. These can be unit tests of just the behavior of the one function (as opposed to integration tests), and don't need to be exhaustive (eg, just a few representative example releases probably enough).