howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Getting PDF from identifiers #560

Open kermitt2 opened 6 years ago

kermitt2 commented 6 years ago

For exploiting the annotations, we need to be able to get the same version of the PDF which has been used by the annotators.

If I am not wrong, the only information for doing this right now is the identifier provided with the article attributes. For PMC, there is no problem because we can find unambiguously the corresponding PDF URL and everything is well archived/preserved at NIH for a couple of millennia.

With a DOI, we have several issues:

For example for DOI: 10.1007/s00148-011-0355-y Unpaywall will give the Sppringer Open Access version: https://link.springer.com/content/pdf/10.1007%2Fs00148-011-0355-y.pdf While the preprint versions on the OA repositories (for instance version linked via https://econpapers.repec.org/paper/nbrnberwo/14900.htm) is a different version: https://www.nber.org/papers/w14900.pdf

Example: 10.1257/089533002320951064 https://api.unpaywall.org/v2/10.1257/089533002320951064?email=patrice.lopez@science-miner.com -> no OA PDF

This is significant, 21 DOI currently are not open access according to Unpaywall.

There are also non-DOI identifiers: a2001-35-NAT_BIOTECHNOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL For these ones, there is no standard and stable automatic way for downloading them.

kermitt2 commented 6 years ago

To solve the issue, what about:

jameshowison commented 6 years ago

Hi Patrice,

Sorry, I didn't understand this question at first. I have all the PDFs that the annotators used, I just haven't made that repository public. Sorry for extra work here (although it is certainly important for when we release the dataset, I don't know if we can release the PDFs with it). I have added you to that repo, I hope it is what was needed.

--J

On Fri, Oct 19, 2018 at 4:03 PM Patrice Lopez notifications@github.com wrote:

To solve the issue, what about:

-

keeping track of the original url of the PDF in the dataset

preserved a version on a AWS S3 space?

ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/560#issuecomment-431498315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnUq47B0iuOIbUgp52M8a43FHPM-dVks5umj43gaJpZM4XxaRj .

-- James Howison

Associate Professor and Director of Doctoral Studies School of Information University of Texas at Austin http://james.howison.name

kermitt2 commented 6 years ago

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

jameshowison commented 6 years ago

That makes sense to me. I thought we'd talked about that with Jason and Heather and that the articles from unpaywall were all green open access? Is there a straightforward way to find out? I can definitely avoid coding any more that aren't green open access.

--J

On Fri, Oct 19, 2018 at 4:28 PM Patrice Lopez notifications@github.com wrote:

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/560#issuecomment-431504424, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnUrka0iA58biphkPG_fD2NjsBotvaks5umkQFgaJpZM4XxaRj .

kermitt2 commented 6 years ago

I made a new check and here is the current list of DOI which are not Open Access according to Unpaywall (I am using their web service):

No Open Access PDF found via Unpaywall for DOI: 10.1080/17421772.2011.647058 No Open Access PDF found via Unpaywall for DOI: 10.1002/ijfe.1565 No Open Access PDF found via Unpaywall for DOI: 10.1080/00036846.2016.1218430 No Open Access PDF found via Unpaywall for DOI: 10.1111/jors.12246 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9957.2008.01084.x No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.14.1.109 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9701.2007.01019.x No Open Access PDF found via Unpaywall for DOI: 10.1002/soej.12180 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.25.3.83 No Open Access PDF found via Unpaywall for DOI: 10.1111/cwe.12158 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0297.2008.02177.x No Open Access PDF found via Unpaywall for DOI: 10.3846/1611-1699.2009.10.279-289 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.8.2.117 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.13.1.181 No Open Access PDF found via Unpaywall for DOI: 10.1002/pam.21962 No Open Access PDF found via Unpaywall for DOI: 10.1111/1468-0106.12204 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.13 No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-04-2014-0055 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.30 No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.20 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.2.201 No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0351.2009.00342.x No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.31.2.211 No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-01-2015-0013 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.1.77 No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.2.1.153 No Open Access PDF found via Unpaywall for DOI: 10.1257/089533002320951064

jameshowison commented 5 years ago

@jasonpriem could you take a look here? Seems that some of the DOIs that came from the lists you pulled from unpaywall aren't actually Open Access? I am about to swap over to astro articles and it would be good to avoid similar issues there?

@kermitt2 Could you use the same approach to check the astro articles here: https://github.com/howisonlab/softcite-pdf-files/blob/master/docs/pdf-files/astronomy_pdf_files/journal_articles_astronomy_random_5000_dois_with_pdf_links.csv

kermitt2 commented 5 years ago

Sorry for taking so long to analyse this list of papers! It was a bit more complicated than I thought, here are the results:

You'll find attached here these 1778 sucessful DOI with their Open Access link.