kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Issue with training and evaluating the header module #1158

Closed Alexr951 closed 2 months ago

Alexr951 commented 2 months ago

Hello,

I have been trying to train the header model of Grobid using 81 test files based on PHD job market papers. The goal is to be able to differentiate real authors and names listed in the acknowledgement section to have a usable dataset. I am running Grobid on WSL using Linux Ubuntu and Java 17.0.11. I am using the current development version of Grobid (0.8.1-SNAPSHOT).

I am having a problem when I try to train and evaluate (segmentation ratio of 0.8), the training runs for approximately five-ten minutes then I get the error:

Exception in thread "main" org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while evaluating Grobid.

To be honest I am unsure of what is causing the issue, I have tried various things, but I can not fix it. I have tried a clean install, and changing java versions. I have looked through my files to see if one of them is formatted wrong but I have found nothing wrong with them. Below I am attaching the output from Grobid.

Thank you in advance!

Grobid_output_error (1).txt

lfoppiano commented 2 months ago

Hi @Ranch951, I try to help you and I have a few questions:

Alexr951 commented 2 months ago

Hi,

Thank you so much for the fast response. To answer the three questions.

lfoppiano commented 2 months ago

Could you share the output of the error when using -s 1?

I'm doing guess work here, I don't know what might be the problem. I can speculate one of the training data are not in sync (raw<->tei) but I'm not sure.

Alexr951 commented 2 months ago

The output for -s 1 is attached below:

It is possible that the training data is not in sync, but I have tried on multiple clean installs being very careful with placing my training files into the specific folders to not cause an error.

-s_1_Grobid_output.txt

lfoppiano commented 2 months ago

This error is OKish 😄 because it's complaining that the data is empty, which is correct, since you selected -s 1

At this point, I need to see the data, if you could send me your training data I will have a look. You can send it at luca AT sciencialab.com

lfoppiano commented 2 months ago

The issue was due to some misalignment between the tei.xml files and the feature files (the text files).