kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

PDF reference annotation: missing clickable references #567

Open DrJimFan opened 4 years ago

DrJimFan commented 4 years ago

I'm using the "PDF reference annotation" demo from grobid-services, v0.5.6. I notice that Grobid can sometimes miss a lot of references in Arxiv papers that are clickable. For example: https://arxiv.org/pdf/2003.14210.pdf Grobid identifies citations in the second paragraph correctly, but misses every occurrence in the first paragraph. In this PDF, all inline citations have hyperlinks to the actual line of reference.

image

Another example: https://arxiv.org/abs/2003.04664

image

Intuitively, these documents should be easier to analyze because all the inline citations are labeled with hyperlinks. However, Grobid doesn't seem to take advantage of this fact. Could you please kindly advise on how to fix?

kermitt2 commented 4 years ago

Hello @LinxiFan

With current master version 0.6.0-SNAPSHOT, results are already better:

Screen Shot 2020-04-12 at 00 38 07

The quality of the identification of the reference markers, the "reference callout" comes from the fulltext model, and unfortunately there's very few training data for this model (a bit more than 30 annotated documents, and it's working actually quite good with so few training data I think) - though I am surprised that the pattern Chess [Silver et al., 2017a], Go [Silver et al., 2017b], is not captured.

The solution to fix that is to add some additional examples of relevant training data, so for instance with this document. You can see in the documentation how to proceed to generated pre-annotated training data and correct it there:

Of course I am happy to review any additional training data and add them to the public repo for the benefit of everybody !

DrJimFan commented 4 years ago

Hi Patrice, thank you so much for your timely response, and very helpful comments!

I'm also really surprised that the model works with only 30 documents of training data. I have a few thoughts:

  1. I'm focusing on Arxiv, which is quickly becoming the single largest source for computer science papers recently. The good news is that most Arxiv papers have Latex source files, which can provide perfect annotations automatically. This is already operationalized by arxiv-vanity to display Arxiv papers as HTML pages - they have perfect inline citation pop-ups. We are talking about tens of thousands, if not hundreds of thousands of documents with latex source on Arxiv CS section alone. I'm happy to contribute in my spare time, but would love to hear what you think first.

  2. Instead of using NLP to process the text and extract the citation, can we take advantage of the embedded links in the PDF? Every citation in the two examples above are clickable. I believe even some hand-crafted rules that rely on the in-text hyperlinks will provide highly accurate extractions for Arxiv. I'm not familiar with PDF processing, so please correct me if I'm wrong.

  3. Are there any forums (e.g. Slack, Discord) that I can join to discuss development?

Thank you again for creating this wonderful tool for the community.

kermitt2 commented 4 years ago

Thank you @LinxiFan for the feedback and these ideas.


I'm also really surprised that the model works with only 30 documents of training data.

For the recognition of the reference callouts, we have in average for PMC articles 70 citation contexts. So for 30 articles, even if they are not all from PMC, that's still in the range of 1500 training citation contexts I think, so it's not that bad for training a model because we have chosen a variety of citation callout styles. It's more difficult right now with other less frequent structures like figures, tables or formula.


  1. I'm focusing on Arxiv, which is quickly becoming the single largest source for computer science papers recently. The good news is that most Arxiv papers have Latex source files, which can provide perfect annotations automatically. This is already operationalized by arxiv-vanity to display Arxiv papers as HTML pages - they have perfect inline citation pop-ups. We are talking about tens of thousands, if not hundreds of thousands of documents with latex source on Arxiv CS section alone. I'm happy to contribute in my spare time, but would love to hear what you think first.

Indeed, the two sources of CC-0/CC-BY potential training data that are natural to consider for the Grobid tasks are PMC XML and indeed Latex sources from arXiv. In both cases I didn't find a way to take advantage of these data - as compared to annotating manually as few as 30 documents from scratch. One first issue is that the inline citations are only one structures out of several others that need to be annotated, so even if we can automate the inline citation annotation, if the other ones are not present, we still need to complete the annotations manually for all the documents and we are stuck.

The second problem is that the text flow in the LaTeX source is very different from the actual rendered layout. There are many macros, styles applied, etc. As we can't emulate all the rendering mechanisms, in practice, the challenge is to be able to align the actual PDF content with the Latex structures via its normalization by LaTeXML (the input of arxiv-vanity). However the PDF content is so noisy and the order of its content might be so different from the source XML that it's very difficult and the rate of failure is very high. If we only consider the subset of documents that are easy to align, we bias the training data to only simple cases, and the model performs badly with new documents.

However, it might be interesting to re-investigate this approach, I didn't spend a lot of time exploring it and it might be maybe less complicated that I first evaluated.

An simpler alternative we discussed at some point was to annotate with PDF annotations of the PDF produced by LaTeX in order to mark some structures. Then we could maybe create training data directly from this PDF, because we have at the same time the crazy noisy content and the expected labeling by looking at the PDF annotation layers (it's a bit similar to your point 2) - it would suppose to write modified LaTex styles that generated these additional annotations, which might still be a lot of work.


  1. Instead of using NLP to process the text and extract the citation, can we take advantage of the embedded links in the PDF? Every citation in the two examples above are clickable. I believe even some hand-crafted rules that rely on the in-text hyperlinks will provide highly accurate extractions for Arxiv. I'm not familiar with PDF processing, so please correct me if I'm wrong.

About 2., using the PDF annotations to help the fulltext model is a planned improvement. We already capture the pdf annotation boxes and map them to the annotated text tokens. The next step for me is to add this information as additional features for the full text model. I think this will be normally more portable and accurate in average than introducing ad hoc rules that might not work for non-arXiv documents... ok not sure about the accuracy, in particular for arXiv articles :)


  1. Are there any forums (e.g. Slack, Discord) that I can join to discuss development?

Yes there is a Grobid development channel at an Inria Mattermost. If you send me your email (my email is in the readme) I can invite you to this mattermost channel. You are very welcome !

DrJimFan commented 4 years ago

Just sent you an email. Thank you so much!