kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
216 stars 70 forks source link

PDF to XML conversion time out for some files in server mode but run the pdfalto_server cmd in shell is fast and returns ok. #150

Closed elonzh closed 2 years ago

elonzh commented 2 years ago

test file:

0cd4f6ed-0006-4b1a-a71d-0db0670050e0.pdf

Service output

org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 143. 
[
  \\wsl$\Ubuntu\home\elonzh\workspace\grobid\grobid-home\pdfalto\lin-64\pdfalto_server, 
  -fullFontName, 
  -noLineNumbers, 
  -noImage, 
  -annotation, 
  -filesLimit, 2000, 
  /home/elonzh/workspace/grobid/grobid-home/tmp/origin11089783328378169294.pdf, 
  /home/elonzh/workspace/grobid/grobid-home/tmp/QCdS7b3245.lxml
]

Syntax Error (308820): Unknown operator 'd<1d><bd><fc><b2><13><0b><8c>'
Syntax Error (308880): Unknown operator 'D<05><9b>n#<ab><a9><bc><ec><97><af><18><e5><f6><f9>|<b8><bd><df><a5><eb><9a>r@<d7><c7><e7>'
Syntax Error (308881): Illegal character '>'
Syntax Error (308888): Illegal character '>'
Syntax Error (308896): Illegal character <dd> in hex string
...
Syntax Error (319522): Illegal character <fe> in hex string

Shell output

$./grobid-home/pdfalto/lin-64/pdfalto_server 
-fullFontName 
-noLineNumbers 
-noImage 
-annotation 
-filesLimit 2000 
0cd4f6ed-0006-4b1a-a71d-0db0670050e0.pdf

Syntax Error (308820): Unknown operator 'd<1d><bd><fc><b2><13><0b><8c>'
Syntax Error (308880): Unknown operator 'D<05><9b>n#<ab><a9><bc><ec><97><af><18><e5><f6><f9>|<b8><bd><df><a5><eb><9a>r@<d7><c7><e7>'
Syntax Error (308881): Illegal character '>'
Syntax Error (308888): Illegal character '>'
Syntax Error (308896): Illegal character <dd> in hex string
...
Syntax Error (319522): Illegal character <fe> in hex string
...
Syntax Error (320317): Illegal character <9a> in hex string
Syntax Error: Unterminated string
Syntax Error: End of file inside array
Syntax Error: Leftover args in content stream
kermitt2 commented 2 years ago

Hi @elonzh !

I think this is a grobid-related issue, that I have fixed a few weeks ago, see https://github.com/kermitt2/grobid/issues/923#issuecomment-1161971179