Open yaojl2006 opened 7 years ago
Hello!
Yes bibliographical references introduced as footnotes are not identified well currently because of the lack of training data covering these cases (normally rare in STEM but quite frequent in Humanities). In the current framework, this required more training data for the segmentation model - which is currently time-consuming to produce.
Hello @kermitt2 ,
Thank for such an amazing tool. I have some doubt regarding the bibliographical references in footnotes and need your help. I want to train GROBID on articles(Social Science research papers) that contain references in the footnotes. As mentioned above such references mentioned in the footnote were not identified. So to train the model do you suggest to edit the output of the segmentation model which failed to identify such footnote sections and tag them under
for e.g the references 1,2 mentioned in the footnote was not identified.
the corresponding segmentation output for the above image.
Hello @ShriharshAmbhore
Thanks for your question.
Indeed the segmentation model is performing the task of locating the bibliographical references - so either as section at the end of the document, or as foot notes like typically in the SSR papers. More training examples for bib. references in footnotes are clearly needed.
We could add more precision in the guidelines for annotating training data for the segmentation model if it's not clear. The label to use to identify the bibliographical areas is <listBibl>
.
So, basically, after producing the training data files from a PDF, you can edit the *.segmentation.tei.xml
file and when you find a bibliographical citation in footnotes not identified as such, simply close the body section (</body>
) and start a <listBibl>
section around the bibliographical references in footnotes.
However it's important that the complete segmentation training file is well labelled. So the other areas have to be checked too.
Then you can move the file(s) into the training data folder for the segementation model (as documented) and retrain the segmentation model to capture your new examples.
In case it is reusable OA articles (CC-0 ot CC-BY), we could even add these training data in the public repository so that everybody can take advantage of this improvement :) I would then review the annotations so that it is examined by at least two different persons.
Hello @kermitt2 , Thank you for the response. Your suggestion helped a lot. Regarding adding more training data in the public repository, the reference papers from Humanities it is available at the link mentioned below. Additional training data More info on project: EXCITE
References are possibly at every page's last in some articles,which grobid can't analyze well. such as this article: # Spin Density Waves in an Electron Gas
....than 2g. This minimum thermal gap may indeed be quite small because a SDW instability can be greater the smaller the magnitude of Q, provided that the ———————— '0 J.Bardeen, L. N. Cooper, and J.R. Schrieffer, Phys. Rev. 108, &&7' 5 (&957). J. C. Swihart, IBM J. Research Develop. 6, |4 (1962); Phys. Rev. 116, 346 (1959).
I'm trying to retrain crf model to overcome this problem, but I don't have enough data to train so I'm not sure the presicion will be at last. :(