kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

RFC: generating training data automatically through BibTeX #147

Open michamos opened 7 years ago

michamos commented 7 years ago

The back-story

At INSPIRE, we recently looked at the latest version of grobid 0.4.1 for our needs of extracting references from PDF files. While some aspects are really impressive (reference segmentation, title and author recognition), it falls short in several other aspects that are crucial for us, as summarized by @kaplun in inspirehep/inspire-next#1704.

So we would need to train the citation model on a corpus that is more similar to what we encounter, with a particular eye on extra data that grobid cannot currently recognize (arXiv IDs, report numbers, DOIs).

The proposal

Instead of doing this tediously by hand on real papers, we were thinking of generating the training data through BibTeX (BTW, I could not find a way to generate training data for the "citations" model from PDFs in the docs the closest I found is createTrainingReferenceSegmentation) in the following way:

  1. Edit a set of common BibTeX styles to remember the tags (author, journal title, etc.) besides the actual references
  2. Generate a bibliography in BibTeX format by picking papers semi-randomly from our database
  3. Generate the references by running it through BibTex with a style file enhanced as in 0.
  4. Convert the LaTeX output of 2. to TEI by simple substitution of LaTeX commands to XML tags and cleanup (~ which mean non-breakable space to space)
  5. Repeat for other style files

In this way, we could generate a large corpus with little work. By using the publisher styles plus the most popular user styles, I expect we could mimick quite well the real world PDFs that we encounter.

Open questions

lfoppiano commented 7 years ago

Dear @michamos looking at the examples you've provided me on inspirehep/inspire-next#1704 looks like that more training data are required for the citation model instead of the reference segmentation model.

The ReferenceSegmentationModel recognise reference within the whole document.
See three differen examples (arXiv, report note, DOI):

        <bibl><label>[1]</label> ETM Collaboration, C. Helmes et al., JHEP 09, 109 (2015), arXiv:1506.00408 [hep-lat].<lb/> </bibl>
        <bibl><label>[25]</label> ATLAS and CMS Collaborations, &quot; Procedure for the LHC Higgs boson search<lb/> combination in Summer 2011 &quot; , Technical Report ATL-PHYS-PUB 2011-11, CMS NOTE<lb/> 2011/005, 2011.<lb/> </bibl>
        <bibl><label>[26]</label> G. Cowan, K. Cranmer, E. Gross, and O. Vitells, &quot; Asymptotic formulae for<lb/> likelihood-based tests of new physics &quot; , Eur. Phys. J. C 71 (2011) 1554,<lb/> doi:10.1140/epjc/s10052-011-1554-0, arXiv:1007.1727.<lb/> </bibl>

The Dois, URLs are all within the reference therefore the problem lays on the citation model, which is used to parse the various pieces within the reference.

Looking at the training corpus seems that there is no training data covering the dois and very few with arXiv references (you can see in the citation training corpus).

Example:

screen shot 2016-11-30 at 11 23 32

Regarding your proposal to automatically generate training data, it could work because this model is pretty simple and the training data are too. I'm not sure you could really use the BibTex file. I didn't find an example of bibtex where the additional information (doi, arxiv, report number) are present:

@article{cowan2011asymptotic,
  title={Asymptotic formulae for likelihood-based tests of new physics},
  author={Cowan, Glen and Cranmer, Kyle and Gross, Eilam and Vitells, Ofer},
  journal={The European Physical Journal C},
  volume={71},
  number={2},
  pages={1--19},
  year={2011},
  publisher={Springer}
}

To answer you one of the questions:

  1. for the citation model the PDF is not required anymore, because it works with plain text.
  2. I personally don't think that many examples are required, because the element you want to recognise are pretty standard. The best would be to add examples in evaluation and training and see the results.
kermitt2 commented 7 years ago

Hello @michamos

So as noted also by you and @lfoppiano, there's indeed almost no example of DOI/arXivID currently in the training data for the citation model (the model related to this information). It's rare to see DOI in bibliographical references, and the training data for citations do not come from arXiv articles in general.

I agree with @lfoppiano that the simplest solution would be to add more examples to this model with this identifier, and because they are very simple patterns, they should be learnt with relatively few examples. Simple, but boring to annotate/correct data ;)

If we have in the BibTex file the DOI/arXivID, I think your proposal can also work for generating reliable training data for the citation model, because as mentioned by @lfoppiano, the citation model works with text input. Normally what would be complicated is the step 3. If I understand well your proposal, the idea would be to modify the BibTex style files (step 0) in order to generate the references augmented with the tags. From this modified output, converting to TEI would be only simple substitution/cleaning :+1:

The possible difficulty I see could be to cover enough various BibTex style files to have a good variety of citations. But possibly simply with a couple of citation styles, the CRF will be able to generalize enough for addressing this identifier in all bibliographical references.

The next question really interesting I think is: Could we use the same approach for a model that exploits the layout for structuring piece of documents (like the reference-segmentation model, the full text model or the header model)? This is important, because we really lack training data for all these models.

If we augment the LaTex output with extra tags to keep track of the fields, we will modify the layout of the PDF and then we cannot use the PDF for reliable training. An alternative could be to keep track of the "fields" by annotations to PDF generated by the modified LaTex style - so annotations in the annotation layer of the PDF. I have added in the last weeks in GROBID the capture of all the annotations present in the PDF, and the possibility to align what is annotated (the text content in the text layer) with the annotation - by matching coordinates of the two objects in the PDF.

So at some point, if your approach for citation works, I would be very interested to then investigate the exploitation of TeX files for other models with the PDF annotation-layer approach.