Open michamos opened 7 years ago
Dear @michamos looking at the examples you've provided me on inspirehep/inspire-next#1704 looks like that more training data are required for the citation model instead of the reference segmentation model.
The ReferenceSegmentationModel recognise reference within the whole document.
See three differen examples (arXiv, report note, DOI):
<bibl><label>[1]</label> ETM Collaboration, C. Helmes et al., JHEP 09, 109 (2015), arXiv:1506.00408 [hep-lat].<lb/> </bibl>
<bibl><label>[25]</label> ATLAS and CMS Collaborations, " Procedure for the LHC Higgs boson search<lb/> combination in Summer 2011 " , Technical Report ATL-PHYS-PUB 2011-11, CMS NOTE<lb/> 2011/005, 2011.<lb/> </bibl>
<bibl><label>[26]</label> G. Cowan, K. Cranmer, E. Gross, and O. Vitells, " Asymptotic formulae for<lb/> likelihood-based tests of new physics " , Eur. Phys. J. C 71 (2011) 1554,<lb/> doi:10.1140/epjc/s10052-011-1554-0, arXiv:1007.1727.<lb/> </bibl>
The Dois, URLs are all within the reference therefore the problem lays on the citation model, which is used to parse the various pieces within the reference.
Looking at the training corpus seems that there is no training data covering the dois and very few with arXiv references (you can see in the citation training corpus).
Example:
Regarding your proposal to automatically generate training data, it could work because this model is pretty simple and the training data are too. I'm not sure you could really use the BibTex file. I didn't find an example of bibtex where the additional information (doi, arxiv, report number) are present:
@article{cowan2011asymptotic,
title={Asymptotic formulae for likelihood-based tests of new physics},
author={Cowan, Glen and Cranmer, Kyle and Gross, Eilam and Vitells, Ofer},
journal={The European Physical Journal C},
volume={71},
number={2},
pages={1--19},
year={2011},
publisher={Springer}
}
To answer you one of the questions:
Hello @michamos
So as noted also by you and @lfoppiano, there's indeed almost no example of DOI/arXivID currently in the training data for the citation model (the model related to this information). It's rare to see DOI in bibliographical references, and the training data for citations do not come from arXiv articles in general.
I agree with @lfoppiano that the simplest solution would be to add more examples to this model with this identifier, and because they are very simple patterns, they should be learnt with relatively few examples. Simple, but boring to annotate/correct data ;)
If we have in the BibTex file the DOI/arXivID, I think your proposal can also work for generating reliable training data for the citation model, because as mentioned by @lfoppiano, the citation model works with text input. Normally what would be complicated is the step 3. If I understand well your proposal, the idea would be to modify the BibTex style files (step 0) in order to generate the references augmented with the tags. From this modified output, converting to TEI would be only simple substitution/cleaning :+1:
The possible difficulty I see could be to cover enough various BibTex style files to have a good variety of citations. But possibly simply with a couple of citation styles, the CRF will be able to generalize enough for addressing this identifier in all bibliographical references.
The next question really interesting I think is: Could we use the same approach for a model that exploits the layout for structuring piece of documents (like the reference-segmentation model, the full text model or the header model)? This is important, because we really lack training data for all these models.
If we augment the LaTex output with extra tags to keep track of the fields, we will modify the layout of the PDF and then we cannot use the PDF for reliable training. An alternative could be to keep track of the "fields" by annotations to PDF generated by the modified LaTex style - so annotations in the annotation layer of the PDF. I have added in the last weeks in GROBID the capture of all the annotations present in the PDF, and the possibility to align what is annotated (the text content in the text layer) with the annotation - by matching coordinates of the two objects in the PDF.
So at some point, if your approach for citation works, I would be very interested to then investigate the exploitation of TeX files for other models with the PDF annotation-layer approach.
The back-story
At INSPIRE, we recently looked at the latest version of grobid 0.4.1 for our needs of extracting references from PDF files. While some aspects are really impressive (reference segmentation, title and author recognition), it falls short in several other aspects that are crucial for us, as summarized by @kaplun in inspirehep/inspire-next#1704.
So we would need to train the citation model on a corpus that is more similar to what we encounter, with a particular eye on extra data that grobid cannot currently recognize (arXiv IDs, report numbers, DOIs).
The proposal
Instead of doing this tediously by hand on real papers, we were thinking of generating the training data through BibTeX (BTW, I could not find a way to generate training data for the "citations" model from PDFs in the docs the closest I found is
createTrainingReferenceSegmentation
) in the following way:In this way, we could generate a large corpus with little work. By using the publisher styles plus the most popular user styles, I expect we could mimick quite well the real world PDFs that we encounter.
Open questions