kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

createTRAINING batch command #1149

Open ap-mps opened 3 months ago

ap-mps commented 3 months ago

when running this command I noticed that corresponding to a certain PDF present in the 'directory of input files' files for the header model are not generated ?

Why so and generally is there a criteria for generation of output files model wise corresponding to an input pdf?

kermitt2 commented 3 months ago

Hello !

Normally it means that the PDF is image only (Grobid does not include an OCR, it has to be provided as pre-processing). Other possible explanations: encrypted PDF or corrupted PDF. Finally it's also possible that no header is detected by the segmentation model which is applied first. In the last case, it means the corrected segmentation training file has to be put first in the segmentation training and the segmentation model updated.