kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

PDF to XML conversion error #1033

Open SebastianFeltl opened 1 year ago

SebastianFeltl commented 1 year ago

We use GROBID in our Service to convert PDFs to XML and to then process them further.

All of the tested PDFs worked except one (Klug Hahn 2021 - Conversational Interfaces and Digital Empathy.pdf.

We then tried to use and online demo of GROBID we found in an old error Issue and we got this error message Error Message

We don´t know why this specific PDF doesn´t work and would like to request your help in figuring out this problem.

We are using Ubuntu 20.04.6 LTS for our Service, but since the online demo throws the same error, I don´t know if you need any System infos from us.

kermitt2 commented 1 year ago

Hi @SebastianFeltl !

Nice & surprising error case, thank you.

pdfalto (our library for parsing the pdf) crashes because of the annotations in this PDF. It's even more surprising that I don't see any annotation in the PDF.

I open an issue in the pdfalto repo.

The following crashes:

./grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 ~/Downloads/Klug.Hahn.2021.-.Conversational.Interfaces.and.Digital.Empathy.pdf  --timeout  120
Segmentation fault (core dumped)

this works fine:

./grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -filesLimit 2000 ~/Downloads/Klug.Hahn.2021.-.Conversational.Interfaces.and.Digital.Empathy.pdf  --timeout  120
ll ~/Downloads/Klug.Hahn.2021.-.Conversational.Interfaces.and.Digital.Empathy.xml 
-rw-rw-r-- 1 lopez lopez 869K Jun 23 19:30 /home/lopez/Downloads/Klug.Hahn.2021.-.Conversational.Interfaces.and.Digital.Empathy.xml