kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Grobid batch in createTraining mode generates filenames incompatible with grobid-trainer #334

Open philgooch opened 6 years ago

philgooch commented 6 years ago

Using the latest 0.5.2 snapshot, I find that when running createTraining, files are created with the following pattern:

*.training.header.tei.xml, *.training.date.tei.xml etc *.training.header

But grobid-trainer does not like .training in the filenames, and reports error: no train data loaded. The .training in the filename needs to be removed from each file for grobid-trainer to recognise the files as valid training data.

kermitt2 commented 6 years ago

Hello @philgooch !

Could you tell me when grobid-trainer reports this error - for which model? I think this is the thing to fix - there are plenty of training files with a .training in the file name currently for many models. Thanks !

philgooch commented 6 years ago

Hi Patrice, sure, I was training a header model with

java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.5.2-SNAPSHOT-onejar.jar 0 header -gH grobid-home

When I removed .training from the names of the training files, I no longer got the error and training worked fine. It seems any files with .training in the filename get ignored, as when I mixed in some files without .training in the filename, only those files showed in the console under the nb train field

dineshbs7 commented 5 years ago

Hi sir, same command i used training citation module but got error. Not completly compiled What to do