kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 448 forks source link

can't analyze some types of articles in which references are not at the last page. #257

Open yaojl2006 opened 6 years ago

yaojl2006 commented 6 years ago

References are possibly at every page's last in some articles,which grobid can't analyze well. such as this article: # Spin Density Waves in an Electron Gas

....than 2g. This minimum thermal gap may indeed be quite small because a SDW instability can be greater the smaller the magnitude of Q, provided that the ———————— '0 J.Bardeen, L. N. Cooper, and J.R. Schrieffer, Phys. Rev. 108, &&7' 5 (&957). J. C. Swihart, IBM J. Research Develop. 6, |4 (1962); Phys. Rev. 116, 346 (1959).

I'm trying to retrain crf model to overcome this problem, but I don't have enough data to train so I'm not sure the presicion will be at last. :(

kermitt2 commented 6 years ago

Hello!

Yes bibliographical references introduced as footnotes are not identified well currently because of the lack of training data covering these cases (normally rare in STEM but quite frequent in Humanities). In the current framework, this required more training data for the segmentation model - which is currently time-consuming to produce.

shriharsh-a commented 6 years ago

Hello @kermitt2 , Thank for such an amazing tool. I have some doubt regarding the bibliographical references in footnotes and need your help. I want to train GROBID on articles(Social Science research papers) that contain references in the footnotes. As mentioned above such references mentioned in the footnote were not identified. So to train the model do you suggest to edit the output of the segmentation model which failed to identify such footnote sections and tag them under ? Awaiting response.

for e.g the references 1,2 mentioned in the footnote was not identified. image

the corresponding segmentation output for the above image.

image

kermitt2 commented 6 years ago

Hello @ShriharshAmbhore

Thanks for your question.

Indeed the segmentation model is performing the task of locating the bibliographical references - so either as section at the end of the document, or as foot notes like typically in the SSR papers. More training examples for bib. references in footnotes are clearly needed.

We could add more precision in the guidelines for annotating training data for the segmentation model if it's not clear. The label to use to identify the bibliographical areas is <listBibl>.

So, basically, after producing the training data files from a PDF, you can edit the *.segmentation.tei.xml file and when you find a bibliographical citation in footnotes not identified as such, simply close the body section (</body>) and start a <listBibl> section around the bibliographical references in footnotes.

However it's important that the complete segmentation training file is well labelled. So the other areas have to be checked too.

Then you can move the file(s) into the training data folder for the segementation model (as documented) and retrain the segmentation model to capture your new examples.

In case it is reusable OA articles (CC-0 ot CC-BY), we could even add these training data in the public repository so that everybody can take advantage of this improvement :) I would then review the annotations so that it is examined by at least two different persons.

shriharsh-a commented 5 years ago

Hello @kermitt2 , Thank you for the response. Your suggestion helped a lot. Regarding adding more training data in the public repository, the reference papers from Humanities it is available at the link mentioned below. Additional training data More info on project: EXCITE