ourresearch / oadoi

The backend code that powers Unpaywall. support@unpaywall.org
http://unpaywall.org
MIT License
311 stars 37 forks source link

Potential false positives concerning updated articles #102

Closed jaanisoe closed 5 years ago

jaanisoe commented 6 years ago

For example, Unpaywall will return the following for the DOI 10.1107/S0907444905005883:

"oa_locations": [
  {
    "evidence": "oa repository (via OAI-PMH title and last author match)",
    "host_type": "repository",
    "is_best": true,
    "license": null,
    "pmh_id": "oai:pubmedcentral.nih.gov:168914",
    "updated": "2018-01-16T13:58:59.621944",
    "url": "http://europepmc.org/articles/pmc168914?pdf=render",
    "url_for_landing_page": "http://europepmc.org/articles/pmc168914",
    "url_for_pdf": "http://europepmc.org/articles/pmc168914?pdf=render",
    "version": "publishedVersion"
  }
],
"published_date": "2005-04-20",
"publisher": "International Union of Crystallography (IUCr)",
"title": "SSEP-2.0: Secondary Structural Elements of Proteins",

The problem is, that the only found article in oa_locations -- PMC168914 -- is actually titled "SSEP: secondary structural elements of proteins" and published in 2003. This older published version of the article has actually the DOI 10.1093/nar/gkg507, for which Unpaywall will also (and correctly) return "http://europepmc.org/articles/pmc168914" as one of the URLs in oa_locations.

I guess that, whether or not this is a bug depends on if for a given DOI we expect to get the exact corresponding article in PMC (or some other OA location) or if it is OK to also return an older or updated version of the article.

Another interesting case is for DOI 10.1093/nar/gkr1080:

{
  "evidence": "oa repository (via OAI-PMH doi match)",
  "host_type": "repository",
  "is_best": false,
  "license": "cc-by-nc",
  "pmh_id": "oai:pubmedcentral.nih.gov:3245066",
  "updated": "2017-10-21T10:47:13.203014",
  "url": "http://europepmc.org/articles/pmc3245066?pdf=render",
  "url_for_landing_page": "http://europepmc.org/articles/pmc3245066",
  "url_for_pdf": "http://europepmc.org/articles/pmc3245066?pdf=render",
  "version": "publishedVersion"
},
{
  "evidence": "oa repository (via OAI-PMH doi match)",
  "host_type": "repository",
  "is_best": false,
  "license": "implied-oa",
  "pmh_id": "oai:pubmedcentral.nih.gov:540057",
  "updated": "2017-10-21T11:40:24.572956",
  "url": "http://europepmc.org/articles/pmc540057?pdf=render",
  "url_for_landing_page": "http://europepmc.org/articles/pmc540057",
  "url_for_pdf": "http://europepmc.org/articles/pmc540057?pdf=render",
  "version": "publishedVersion"
},
...
{
  "evidence": "oa repository (via OAI-PMH doi match)",
  "host_type": "repository",
  "is_best": false,
  "license": null,
  "pmh_id": "oai:CiteSeerX.psu:10.1.1.106.4896",
  "updated": "2017-10-21T10:27:37.633823",
  "url": "http://nar.oxfordjournals.org/cgi/reprint/33/suppl_1/D91.pdf",
  "url_for_landing_page": "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4896",
  "url_for_pdf": "http://nar.oxfordjournals.org/cgi/reprint/33/suppl_1/D91.pdf",
  "version": "submittedVersion"
}

We get two different articles in PMC for the given DOI, but of course only one can be an exact match. The DOI is about the "MAPPER2 Database" from 2011 for which PMC3245066 is the correct match. The other returned article PMC540057 is about the "MAPPER database" from 2004 which is actually corresponding to the older DOI 10.1093/nar/gki103. For this older DOI, Unpaywall does indeed correctly return PMC540057 and does not return PMC3245066 of the newer DOI. Also, the CiteSeerX entry "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4896" returned for the newer DOI 10.1093/nar/gkr1080 is about the older "MAPPER database" article from 2004/2005.

Similarly, Unpaywall returns two different articles from PMC for the DOI 10.1093/nar/gkq1237. There is some mixup for the older versions of the article as well: 10.1093/nar/gkl993 and 10.1093/nar/gki031.

Another somewhat different case: 10.1038/npre.2009.3322.1 and 10.1038/npre.2011.6505.1 are different versions of the same poster (the first from 2009 and the second from 2011). For both cases Unpaywall will return both versions in oa_locations.

A common trait for all the URLs in oa_locations that are not exactly about the queried article seems to be that their evidence is found via OAI-PMH: "oa repository (via OAI-PMH title and last author match)", "oa repository (via OAI-PMH title and first author match)" and "oa repository (via OAI-PMH doi match)".

richard-orr commented 5 years ago

Moved to https://support.unpaywall.org/public/tickets/0f2565cf08e6bdf43606359fb37237aeff0cc4ca8da1648956d9be4ee27e88e6