kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.44k stars 443 forks source link

Additional training data to extract more affiliations #337

Open nemobis opened 6 years ago

nemobis commented 6 years ago

I have a dataset on which the latest version seems to fail extracting any affiliation data in 40 % of the cases. (The dataset might be considered less palatable than average; it's also returning many org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content and Cannot detect language because of: com.cybozu.labs.langdetect.LangDetectException: no features in text.)

Is it useful for the project to contribute more PDFs and affiliation training data (as described in https://grobid.readthedocs.io/en/latest/training/affiliation-address/ ), possibly with a focus on currently missing journals/publishers, to hopefully improve recall?

kermitt2 commented 6 years ago

Hello @nemobis !

First comment, the [NO_BLOCKS] and no features in text problems visible in the logs mean normally that the text cannot be accessed in the PDF. So these PDF are only image. You would need to find a version of the PDF with "native" text inside, or to preliminarly use an OCR before using GROBID on them.

Then indeed, the results for affiliation-address in unknown journal/conference layouts which are very different from the one of the existing training data, will be bad, as you report. The first solution is to add more training data and retrain the models to cover these new layouts.

For knowing which training data to add (which model to retrain), we need to see if the whole block affiliation-adressed is missed by the header model, which means we need to add more header examples, or if the block is well recognized but the parsing of the fields is failing - then we need to add traninig data for the affiliation-address parser.

If it's OK for you, you could send me some sample of these PDF or the links to these difficult PDF and I can have a look. For creating additional training data and retraining, the documentation offers annotation guidelines and detailed steps to follow.

nemobis commented 6 years ago

Patrice Lopez, 15/08/2018 15:01:

First comment, the |[NO_BLOCKS]| and |no features in text| problems visible in the logs mean normally that the text cannot be accessed in the PDF. So these PDF are only image. You would need to find a version of the PDF with "native" text inside, or to preliminarly use an OCR before using GROBID on them.

Right. I suspected so, but I couldn't determine whether grobid is supposed to do OCR or not. Does the OCR text need to be mapped to the image word by word (as e.g. OCRmyPDF attempts to do)?

[...] If it's OK for you, you could send me some sample of these PDF or the links to these difficult PDF and I can have a look. [...] Sure, I will. I'll try to select some relevant examples to avoid overwhelming you. It will probably take me a few days,

kermitt2 commented 6 years ago

Right. I suspected so, but I couldn't determine whether grobid is supposed to do OCR or not.

No OCR in GROBID for the moment... there will be soon normally some OCR but it will be limited to problematic glyphs in the input PDF (whose UTF-8 code identification is failing - this is quite common unfortunately). We have not considered adding full OCR to GROBID, because that would add a lot of complexity to the tool, and I think this is better achieved by an integration in a workflow by users with their own OCR tools.

Does the OCR text need to be mapped to the image word by word (as e.g. OCRmyPDF attempts to do)?

Yes the OCR need to add a text layer to the PDF mapping text to the PDF image - coordinates are exploited by GROBID machine learning models. So OCRmyPDF is good (and it uses Tesseract, which produces really decent accuracy)

It will probably take me a few days,

No problem!