kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.46k stars 446 forks source link

Errors: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134 and 139 #1120

Open RANN9 opened 4 months ago

RANN9 commented 4 months ago

Hi mighty developers

I am using GROBID for research which I need to extract text (processFulltextDocument) from some company annual report PDF files. I know GROBID is designed for academic documents but it is able to process most of my documents very well. The problem is, for some documents, like 30% of my whole document set (around 1000 PDFs), there were errors: [BAD_INPUT_DATA] 134, [BAD_INPUT_DATA] 139 and [GENERAL] An exception occurred while running Grobid. Besides, there are documents very similar to those with error codes and GROBID is able to process them. I have uploaded a few examples corresponding to each error code. Are there any workarounds or solutions for these errors? Thanks!

Examples with error code:

Environment:

The error code also appears to be the same using local GROBID Service and HuggingFace

lfoppiano commented 4 months ago

@RANN9 thanks a lot for the report. I will look into it in the next weeks.

lfoppiano commented 4 months ago

@RANN9 How much memory are you allocating to the docker and to the JVM?

RANN9 commented 4 months ago

Hi @lfoppiano thanks for getting back to me.