Open nemobis opened 6 years ago
Hello @nemobis !
First comment, the [NO_BLOCKS]
and no features in text
problems visible in the logs mean normally that the text cannot be accessed in the PDF. So these PDF are only image. You would need to find a version of the PDF with "native" text inside, or to preliminarly use an OCR before using GROBID on them.
Then indeed, the results for affiliation-address in unknown journal/conference layouts which are very different from the one of the existing training data, will be bad, as you report. The first solution is to add more training data and retrain the models to cover these new layouts.
For knowing which training data to add (which model to retrain), we need to see if the whole block affiliation-adressed is missed by the header model, which means we need to add more header examples, or if the block is well recognized but the parsing of the fields is failing - then we need to add traninig data for the affiliation-address parser.
If it's OK for you, you could send me some sample of these PDF or the links to these difficult PDF and I can have a look. For creating additional training data and retraining, the documentation offers annotation guidelines and detailed steps to follow.
Patrice Lopez, 15/08/2018 15:01:
First comment, the |[NO_BLOCKS]| and |no features in text| problems visible in the logs mean normally that the text cannot be accessed in the PDF. So these PDF are only image. You would need to find a version of the PDF with "native" text inside, or to preliminarly use an OCR before using GROBID on them.
Right. I suspected so, but I couldn't determine whether grobid is supposed to do OCR or not. Does the OCR text need to be mapped to the image word by word (as e.g. OCRmyPDF attempts to do)?
[...] If it's OK for you, you could send me some sample of these PDF or the links to these difficult PDF and I can have a look. [...] Sure, I will. I'll try to select some relevant examples to avoid overwhelming you. It will probably take me a few days,
Right. I suspected so, but I couldn't determine whether grobid is supposed to do OCR or not.
No OCR in GROBID for the moment... there will be soon normally some OCR but it will be limited to problematic glyphs in the input PDF (whose UTF-8 code identification is failing - this is quite common unfortunately). We have not considered adding full OCR to GROBID, because that would add a lot of complexity to the tool, and I think this is better achieved by an integration in a workflow by users with their own OCR tools.
Does the OCR text need to be mapped to the image word by word (as e.g. OCRmyPDF attempts to do)?
Yes the OCR need to add a text layer to the PDF mapping text to the PDF image - coordinates are exploited by GROBID machine learning models. So OCRmyPDF
is good (and it uses Tesseract, which produces really decent accuracy)
It will probably take me a few days,
No problem!
I have a dataset on which the latest version seems to fail extracting any affiliation data in 40 % of the cases. (The dataset might be considered less palatable than average; it's also returning many
org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
andCannot detect language because of: com.cybozu.labs.langdetect.LangDetectException: no features in text
.)Is it useful for the project to contribute more PDFs and affiliation training data (as described in https://grobid.readthedocs.io/en/latest/training/affiliation-address/ ), possibly with a focus on currently missing journals/publishers, to hopefully improve recall?