kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 450 forks source link

Timeout/Error when processing text books #642

Open KnowledgeGarden opened 4 years ago

KnowledgeGarden commented 4 years ago

Latest master 0.6.2 running on a xeon 32gb 1U ubuntu 18.0.4 Fed it a 374mb biology text book from java client on Macbook Pro. It finished one time and made a 1.7mb xml file that looks clean. Fed it 3 more (smaller) text books. Crashed. Restarted and fed it all 4 books, one at a time. Crashed on each, including the biology textbook it completed before. Set the client to 1 thread. No change. Did not experiment with settings on server. Gist of the failure is here

Brief summary: ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin4205259084261240168.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/e4XfWM2BZM.lxml] ERROR [2020-09-26 17:40:37,587] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

ERROR [2020-09-26 17:40:37,590] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. ! org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out ! at org.grobid.core.document.DocumentSource.processPdfToXmlServerMode(DocumentSource.java:237)

KnowledgeGarden commented 4 years ago

I have some evidence to support that this is a memory issue. I just ran 6 smaller pdf reports in a batch mode and it ran fine. The confusing aspect of this issue is that it did, in fact, process a 374mb pdf just fine, just once, and never again.

kermitt2 commented 4 years ago

Hello @KnowledgeGarden !

There is no crash, the PDF parsing part (done by an external process with pdfalto) is just timing out - it's actually a protection to avoid crash and keep the system on.

Grobid is currently designed for independent articles, chapters, short reports and this kind of short documents, not books, full proceedings or phd thesis. There was an effort for supporting books with an additional model (to segment the book/proceedings), but it is not progressing currently.

You should be able to process the fat document by increasing the timeout (see in grobid/grobid-home/config/grobid.properties - probably not necessary you can increase the memory limit for the PDF parsing too

grobid.3rdparty.pdf2xml.memory.limit.mb=6096
grobid.3rdparty.pdf2xml.timeout.sec=60

), but it won't look good because it's a text book and the areas will be messed up.

KnowledgeGarden commented 4 years ago

Thanks very much! I raised mb to 16384 and sec to 300. It failed again but not with a [TIMEOUT] anywhere to be found; the test was on the very same 374mb pdf that it did successfully read once before.

The gist is here

Summary: INFO [2020-09-26 18:40:33,278] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10 ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml process finished with error code: 143. [/home/chief/projects/grobid-installation/grobid-home/pdf2xml/lin-64/pdfalto_server, -noImageInline, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/chief/projects/grobid-installation/grobid-home/tmp/origin229373444390319906.pdf, /home/chief/projects/grobid-installation/grobid-home/tmp/ZzevrbmVzB.lxml] ERROR [2020-09-26 18:40:59,174] org.grobid.core.process.ProcessPdfToXml: pdftoxml return message:

! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)

KnowledgeGarden commented 4 years ago

Would be happy to compress and upload the completed text book if interested.

kermitt2 commented 4 years ago

Thank you! error 143 for the external process, it can be a bit anything (it means the OS has killed the process to avoid something bad :).

So yes, at this stage having the text book PDF will help to understand the problem at the level of pdfalto.

KnowledgeGarden commented 4 years ago

xbiologyConcepts-lr.tei.xml.zip

KnowledgeGarden commented 4 years ago

The original text book is the pdf on this page

KnowledgeGarden commented 4 years ago

Side note: it read an 800mb text book, but not without this error message ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

kermitt2 commented 4 years ago

Thank you @KnowledgeGarden

With 1447 pages and 374MB PDF, we are clearly not with the kind of document that can be structured for the moment by grobid and grobid is basically rather tuned to fail in a safe manner with this kind of PDF than to run them entirely.

You will have more chance to have it processed in batch mode, which is single thread, than with the server because the server has additional safety mechanisms to stop the process (to keep the server safe and continue processing more documents).

Having said that, when running directly with pdfalto, there's no error and 14.5s only:

time ~/grobid/grobid-home/pdf2xml/lin-64/pdfalto -noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 /home/lopez/Downloads/Biology2e-WEB_ICOFkGu.pdf ~/tmp/ZzevrbmVzB.lxml

real    0m14.568s

So it's really the Grobid call process which stops/kills the PDF parsing by safety/paranoia. I think it makes sense to review this part only when a model for book-level will be available and we will start supporting full "monograph".

(By the way, what a really great text book!)

ERROR [2020-09-26 19:31:53,151] org.grobid.core.document.Document: Cannot parse
file: /home/chief/projects/grobid-installation/grobid-home/tmp/WQ356GGngN.lxml_annot.xml

There's some not-well formed XML generated by padalto for the files capturing the outline and annotations in the PDF (there's all sort of PDF dirt there), but it won't impact the rest and main content. It will be fixed in pdfalto at some point in the future :/

KnowledgeGarden commented 4 years ago

Thank you @kermitt2 !!! I ran it in batch (command line mode). Raised memory on config to 12g, and ended up adding a 0 to max tokens (it ran out). It gave a marvelously clean xml file, so clean that there was no content in the block. This event occurred in the console: SEVERE: Cannot parse file: /Users/jackpark/Documents/gitprojects5/grobid/grobid-home/tmp/j8q2dxn7WB.lxml_annot.xml

This run is on my 16gb MacBook Pro. I shall next try building pdfalto and see how that works.

My goal is to populate my OpenSherlock machine reading platform with textbooks before diving into publications. I hope that Grobid will rise to the occasion.

KnowledgeGarden commented 4 years ago

Sigh. pdfalto works but it really feels like I'd be better off just exporting those PDFs to plain text and reading that. There are plenty of libraries to split out paragraphs and sentences; I don't need fonts, positions and all that other data. pdfalto really shows off the power of those models in grobid. I'll just have to wait for that to handle files > 100mb. It seems to work on files less than that.