Closed Levalife closed 1 year ago
Hello @Levalife
This error means that the parsing of the PDF with pdfalto fails. Often it means that the PDF is not parsable/ill-formed/corrupted. It could also mean that the pdfalto process uses too much memory (max memory is defined in the grobid config file) and is killed to protect the grobid server.
Apparently these PDF are corrupted from the warning message before the error:
Syntax Warning: PDF file is damaged
From experience, we always see some amount of PDF failing like this when the PDF comes from the internet wild.
@kermitt2 Changed the focus of the issue a little bit. The problem is with old PDFs that parsed before and now are randomly getting "The write operation timed out"
mmm what is producing this timeout message "The write operation timed out"? Which client are you using to query the Grobid server? It the timeout comes from your client, you could increase its value?
In the Grobid service, if the PDF parsing of Grobid reach a timeout, the error message would be something like "PDF to XML conversion timed out" associated with the error 500.
If you can share one of these failing PDF, I could try to reproduce the problem.
It was a strange glitch in our python client app that was solved after another branch merge. Thank you for your time!
I'm using Ubuntu 22.04.2 LTS and lfoppiano/grobid:0.7.3
java --version openjdk 15.0.10 2023-01-17
All worked well till yesterday. Nothing was changed. It started to fail on some test PDFs with errors:
From the application point of view it looks like
reason=ConnectionError(ProtocolError('Connection aborted.', timeout('The write operation timed out')
Sometimes it fails on several pdfs, after service restart it always fails on 1 pdf minimum. It doesn't look like the server is overloaded. What can be wrong here?