Getting PDF from identifiers

kermitt2 commented 6 years ago

For exploiting the annotations, we need to be able to get the same version of the PDF which has been used by the annotators.

If I am not wrong, the only information for doing this right now is the identifier provided with the article attributes. For PMC, there is no problem because we can find unambiguously the corresponding PDF URL and everything is well archived/preserved at NIH for a couple of millennia.

With a DOI, we have several issues:

using Unpaywall, the Open Access PDF url that we can get can, lead to a PDF but there is no guarantee that this PDF is the same as the one used by the annotator, and no guarantee that the same version will be accessible in the future.

For example for DOI: 10.1007/s00148-011-0355-y Unpaywall will give the Sppringer Open Access version: https://link.springer.com/content/pdf/10.1007%2Fs00148-011-0355-y.pdf While the preprint versions on the OA repositories (for instance version linked via https://econpapers.repec.org/paper/nbrnberwo/14900.htm) is a different version: https://www.nber.org/papers/w14900.pdf

Open Access PDF identified by Unpaywall for a DOI is not always reliable over time. For instance, this DOI (10.1007/bf00163432) is associated to an open access PDF via Unpaywall, but the URL lead now to a paid version: https://link.springer.com/content/pdf/10.1007%2FBF00163432.pdf
some DOI are not associated with an Open Access PDF by Unpaywall... in this case, we cannot access it automatically and the PDF might be copyrighted so we cannot exploit the annotations (ML model will be a derived product under copyright too)

Example: 10.1257/089533002320951064 https://api.unpaywall.org/v2/10.1257/089533002320951064?email=patrice.lopez@science-miner.com -> no OA PDF

This is significant, 21 DOI currently are not open access according to Unpaywall.

There are also non-DOI identifiers: a2001-35-NAT_BIOTECHNOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL For these ones, there is no standard and stable automatic way for downloading them.

kermitt2 commented 6 years ago

To solve the issue, what about:

keeping track of the original url of the PDF in the dataset
preserved a version on a AWS S3 space?
ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement

jameshowison commented 6 years ago

Hi Patrice,

Sorry, I didn't understand this question at first. I have all the PDFs that the annotators used, I just haven't made that repository public. Sorry for extra work here (although it is certainly important for when we release the dataset, I don't know if we can release the PDFs with it). I have added you to that repo, I hope it is what was needed.

--J

On Fri, Oct 19, 2018 at 4:03 PM Patrice Lopez notifications@github.com wrote:

To solve the issue, what about:

-

keeping track of the original url of the PDF in the dataset

preserved a version on a AWS S3 space?

ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/560#issuecomment-431498315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnUq47B0iuOIbUgp52M8a43FHPM-dVks5umj43gaJpZM4XxaRj .

-- James Howison

Associate Professor and Director of Doctoral Studies School of Information University of Texas at Austin http://james.howison.name

kermitt2 commented 6 years ago

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

jameshowison commented 6 years ago

That makes sense to me. I thought we'd talked about that with Jason and Heather and that the articles from unpaywall were all green open access? Is there a straightforward way to find out? I can definitely avoid coding any more that aren't green open access.

--J

On Fri, Oct 19, 2018 at 4:28 PM Patrice Lopez notifications@github.com wrote:

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/560#issuecomment-431504424, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnUrka0iA58biphkPG_fD2NjsBotvaks5umkQFgaJpZM4XxaRj .

kermitt2 commented 6 years ago

I made a new check and here is the current list of DOI which are not Open Access according to Unpaywall (I am using their web service):

No Open Access PDF found via Unpaywall for DOI: 10.1080/17421772.2011.647058 No Open Access PDF found via Unpaywall for DOI: 10.1002/ijfe.1565 No Open Access PDF found via Unpaywall for DOI: 10.1080/00036846.2016.1218430 No Open Access PDF found via Unpaywall for DOI: 10.1111/jors.12246 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9957.2008.01084.x No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.14.1.109 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9701.2007.01019.x No Open Access PDF found via Unpaywall for DOI: 10.1002/soej.12180 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.25.3.83 No Open Access PDF found via Unpaywall for DOI: 10.1111/cwe.12158 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0297.2008.02177.x No Open Access PDF found via Unpaywall for DOI: 10.3846/1611-1699.2009.10.279-289 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.8.2.117 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.13.1.181 No Open Access PDF found via Unpaywall for DOI: 10.1002/pam.21962 No Open Access PDF found via Unpaywall for DOI: 10.1111/1468-0106.12204 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.13 No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-04-2014-0055 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.30 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.20 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.2.201 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0351.2009.00342.x No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.31.2.211 No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-01-2015-0013 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.1.77 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.2.1.153 No Open Access PDF found via Unpaywall for DOI: 10.1257/089533002320951064

jameshowison commented 5 years ago

@jasonpriem could you take a look here? Seems that some of the DOIs that came from the lists you pulled from unpaywall aren't actually Open Access? I am about to swap over to astro articles and it would be good to avoid similar issues there?

@kermitt2 Could you use the same approach to check the astro articles here: https://github.com/howisonlab/softcite-pdf-files/blob/master/docs/pdf-files/astronomy_pdf_files/journal_articles_astronomy_random_5000_dois_with_pdf_links.csv

kermitt2 commented 5 years ago

Sorry for taking so long to analyse this list of papers! It was a bit more complicated than I thought, here are the results:

in this list 100% of these DOI are considered OA by the latest Unpaywall data dump (snapshot from last September)
however I failed to download the Open Access resource for 986 out of 5000 entries with my dedicated harvester (https://github.com/kermitt2/biblio-glutton-harvester which supports quite well redirection, multiple retry, etc.), this is high as compared to my usual failure rate for unpaywall (rather around 4%)
out of the 4014 sucessful DOI, only 1778 are actual correct PDF, the rest are abstracts or full texts in html. Apparently, there is an issue with the PDF link via the ADS server, the "url_for_pdf" field actually point to the ADS landing page. So it's a problem specific to Astronomy.
in these 1778 DOI, there are still quite a few documents that are just abstract or very short communication (less than one page), but I don't really have reliable way to detect them...

You'll find attached here these 1778 sucessful DOI with their Open Access link.

howisonlab / softcite-dataset

Getting PDF from identifiers #560

keeping track of the original url of the PDF in the dataset

preserved a version on a AWS S3 space?