internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
114 stars 18 forks source link

duplicates w/ fulltext #38

Open metasj opened 5 years ago

metasj commented 5 years ago

These seem to be 3 identical files w identical metadata: 3-dupes-plastic-factory

bnewbold commented 5 years ago

Thanks for the catch, and filing an issue! These are all the same version of the same paper and should be merged into a single entity. If they were different versions they would still need to be merged under the same "work" entity.

Here are the three release entities and the search query:

Some more background and details:

What happened in this particular case is that I crawled a number of "long-tail" open access journals and inserted about 1.5 million release entities from that crawl without matching to an identifier (like DOI), because most of these works don't have DOIs or other identifiers. Here's what semantic scholar and google scholar know about this paper (note no identifier):

In this case, I crawled 3 near-identical PDFs, and created new release entities for each, so there are three copies.

I wasn't aware of this category of problem from this import, but I am aware of two related problems with the long-tail import: we don't have linked "container" (journal) metadata for these 1.5 million papers, and many of the papers are actually from larger OA publishers (eg, PLOS), but got mixed in with smaller publishers on repository domains that got crawled. Here's an example of the later category of error:

There are a few solutions to these categories of problems:

bnewbold commented 5 years ago

For this specific case of three duplicates, I merged the entities in https://fatcat.wiki/editgroup/shf64rgvgreqbm4dqekjx5d4cq

metasj commented 5 years ago

Thanks for the detailed explanation and links! That really helps me visualize how changes propagate. (I still need to figure out grouping other than redirects.) If the PDFs were completely identical, might the duping still have happened?

bnewbold commented 5 years ago

If the PDFs are identical (usually using SHA1 to check), these failure modes shouldn't happen: import scripts do a lookup before insert.

As a fine print detail, there are something like 20 duplicate file entities (duplicates of same file) that slipped through due to a race-condition when doing early bulk imports, and I haven't cleaned these up yet (merged the entities).