kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

Missing reference segmenter training files #162

Open mjlassila opened 7 years ago

mjlassila commented 7 years ago

It seems that six reference segmenter training files are missing:

Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/4.v90-043.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/6.v90-133.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/7.v90-084.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/8.v89-145.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/9.v89-169.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/submission_83.training.referenceSegmenter does not exist. Please have a look!

My intention is to improve reference segmentation as currently with the material I'm processing (mainly dissertations), table of contents and abbreviation lists are misrecognized as references. Adding new reference segmenter training data improves results, but it came to my mind should I retrain also the segmenter model?

kermitt2 commented 7 years ago

Hello,

Indeed, these files contain the features for training so they can be regenerated from the PDF files. I will have a look.

Actually the reference-segmenter model is only segmenting the reference section into individual references. it's the segmentation model which is segmenting a document into sections such as header, body, footnotes, reference sections, etc.

Models are used in cascade, the first starting model is the segmentation model which is, I think, what you are interested in. I am planing to add shortly some guidelines for annotation for the segmentation model and the fulltext model (in charge of segmenting the body of the document). We really don't have enough training data for these two models currently, so I will try to improve that in priority.