Capture DOIs and other metadata from text of PDFs

robertknight commented 8 years ago

If creating a new Feature Request use this form, and delete the Bug Report Form above

Fields	Your Response
Link to Zendesk ticket:	https://groups.google.com/a/list.hypothes.is/forum/?utm_medium=email&utm_source=footer#!msg/dev/fXPfcQLAwuA/ZW6S07tgDgAJ
User Name/Company:	Austrian Academie of Science
Is this from a client (paid user):	No
What problem is the user trying to solve?	"I'm currently searching for a solution to annotate a Digital Object Identifier (DOI) for the Austrian Academie of Science.". The need, AIUI, is that the user could annotate one version of a paper and have the annotations appear when viewing a different version of the paper, provided it had the same DOI. It also sounds as if the user may want to search for annotations based on the DOI of the paper.
What is the feature the user is requesting?	They want a way to associate annotations made on PDFs with the DOI associated with the PDF

When annotating a web page, the Hypothesis client captures metadata from the page and includes that with the annotation saved to the service. This information is used by the service to establish whether two different URLs refer to the same content. We can also use this information to enable the user to search for annotations based on the metadata of the document that was annotated (eg. title, author, identifier).

The request here is to capture similar information from PDFs.

klemay commented 6 years ago

Here's a use case from CUP to illustrate why this would be helpful. Zendesk ticket: https://hypothesis.zendesk.com/agent/tickets/2732

Summary

In the Cambridge Core/ATI restricted group, there were annotations on the PDF version of an article called "Making the Real: Rhetorical Adduction and the Bangladesh Liberation War" by Joseph O'Mahoney which were not showing up on the HTML version:

Troubleshooting

I made an annotation in the Public layer on the PDF version and another annotation on the HTML version. Neither annotation showed on the other version.

I found my annotations in Metabase and saw that the Document IDs did not match:

Solution

@robertknight explained that we can establish an equivalency by linking to the PDF version in the <head> of the HTML version, and then annotating the HTML version to create an entry in our database establishing the document equivalency. So in this case, Cambridge needed to add the following to the HTML version:

<link rel="alternate" href="https://www.cambridge.org/core/services/aop-cambridge-core/content/view/D7396F6DFDE0914CD3C1C8D7A7141BF9/S0020818317000054a.pdf/making_the_real_rhetorical_adduction_and_the_bangladesh_liberation_war.pdf">

They'll need to do this for every article annotated by the Cambridge Core/ATI group which is onerous on their end. If we could grab the DOI, which is present in the metadata of the HTML version and in the text of the PDF version, the equivalence could be established automatically.

dwhly commented 6 years ago

Just a note here that there is a well known library called CERMINE, here: https://github.com/CeON/CERMINE

This library can (among other things) read DOIs printed in PDFs in places that are likely to indicate that it is the DOI of the article (vs a DOI of another cited article). I think they look for DOIs printed vertically in the spine, and in the upper half of the first page and things like that. If the DOI is where it normally is apparently it has a fairly high success rate. All hearsay of course.

At some point we might want to experiment with something like this.

klemay commented 6 years ago

hypothesis / product-backlog