Automatically import retraction metadata

internetarchive / fatcat

Perpetual Access To The Scholarly Record

https://guide.fatcat.wiki

Other

114 stars 18 forks source link

Automatically import retraction metadata #62

Open bnewbold opened 4 years ago

bnewbold commented 4 years ago

There is a database of retracted papers at: http://retractiondatabase.org/RetractionSearch.aspx?&AspxAutoDetectCookieSupport=1

It would be good to have a bot which periodically fetches updates, and then updates article metadata in fatcat appropriately.

hs2361 commented 4 years ago

I would like to work on this. Could you provide some more details? What kind of mechanism can be used to fetch the data from their database? They have clearly mentioned that scraping the website is prohibited (https://retractionwatch.com/retraction-watch-database-user-guide/).

bnewbold commented 4 years ago

Ah, I didn't notice that. The services are on different domains so I didn't realize they were the same project, but now I see the "User Guide" link. I guess the next step would be to find alternative sources of retraction metadata with persistent identifiers (eg, DOI or PubMed identifier). Some sources I can think of are:

PubMed/MEDLINE itself (we already have a parser for this, could update the import pipeline to allow "updates" to existing entries when the publication_stage does not match or has changed to "retracted")
publisher-specific corpuses, like SciGraph
heristics, like finding publications with the title "Retraction of TITLE", then finding the prior publication from the same journal ("container") and the given title

bnewbold commented 3 years ago

Here is an open corpus of ~100k retractions: http://openretractions.com/

we only know about retractions and other updates that publishers have properly reported to CrossRef or PubMed. That's currently 114596 papers.

I see only a couple thousand retracted "releases" in fatcat today. We do import from crossref and pubmed, so in theory we should have comparable numbers, but we don't run updates automatically yet, so if most of these are from the past couple years we are probably missing them. Also there might be bugs in our crossref and pubmed importers. I don't think we have tests for that code path, so a good first contribution would be adding tests for both crossref and pubmed retractions.