Missing reference segmenter training files

It seems that six reference segmenter training files are missing:

Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/4.v90-043.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/6.v90-133.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/7.v90-084.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/8.v89-145.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/9.v89-169.training.referenceSegmenter does not exist. Please have a look! Raw file [grobid-install-dir]/grobid-home/../grobid-trainer/resources/dataset/reference-segmenter/corpus/raw/submission_83.training.referenceSegmenter does not exist. Please have a look!

My intention is to improve reference segmentation as currently with the material I'm processing (mainly dissertations), table of contents and abbreviation lists are misrecognized as references. Adding new reference segmenter training data improves results, but it came to my mind should I retrain also the segmenter model?

Hello,

Indeed, these files contain the features for training so they can be regenerated from the PDF files. I will have a look.

Actually the reference-segmenter model is only segmenting the reference section into individual references. it's the segmentation model which is segmenting a document into sections such as header, body, footnotes, reference sections, etc.

Models are used in cascade, the first starting model is the segmentation model which is, I think, what you are interested in. I am planing to add shortly some guidelines for annotation for the segmentation model and the fulltext model (in charge of segmenting the body of the document). We really don't have enough training data for these two models currently, so I will try to improve that in priority.

kermitt2 / grobid

Missing reference segmenter training files #162