kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

error 500 #862

Open shainaraza opened 2 years ago

shainaraza commented 2 years ago

I install even verison 0.7.0 but gets this error

Error encountered while requesting the server. Response 500: - The PDF document cannot be annotated. Please check the server logs.

any help

kermitt2 commented 2 years ago

Hello @shainaraza !

Server log are under grobid/logs/grobid-service.log

You should find the issue in the logs, or you could attach it here so that we can have a look.

shainaraza commented 2 years ago

is it version conflict, I just using openjdk 8. here is log grobid-service.log

shainaraza commented 2 years ago

even i use the example file from it, gets the same error

kermitt2 commented 2 years ago

Thanks for the log file.

Windows is not supported any more by Grobid 0.7.0, pdfalto has not been recompiled for Windows 64 (lack of time/skills) and according to the log this is the problem.

You could use the docker image for using Grobid on Windows?

shainaraza commented 2 years ago

no I dont have a docker image, any alternative?

shainaraza commented 2 years ago

i try using COlab environment but it does not allow localhost

kermitt2 commented 2 years ago

If you can't install docker on Windows, I think some people managed to get Grobid working the Windows subsystem/shell for Linux, but I am very ignorant on the matter.

shainaraza commented 2 years ago

hi @kermitt2 Now i am on linux system with all setup but I still get the same error 500, please see log grobid-service.log

kermitt2 commented 2 years ago

According to the logs, the PDF is empty or it is a PDF image only:

! org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content

Do you have a text layer selectable on this PDF ?

shainaraza commented 2 years ago

@kermitt2 thanks you for replying me. I dont know text layer selection? in my case, both in Windows and Linux now, the localhost:8070 works and let me upload a file but gets 500 ERROR after that. advise please

kermitt2 commented 2 years ago

If you open the PDF in a PDF viewer, can you select the text? You can also try the pdftotext command line on Linux, if it returns some text. If there is no text, a preliminary OCR is required.

shainaraza commented 2 years ago

yes I can definitely select the text. I also use TIKA and pdftotext , the reason I want to use Grobid is because of its structure, metadata that it returns. so in short, the issue is with pdf file?

kermitt2 commented 2 years ago

The error message on empty PDF is normally reliable because it comes from pdfalto.

Can you share the PDF here maybe? or by email if it's easier (see my email in the readme)

AaronNGray commented 1 year ago

The error is the -noLineNumbers and the --timeout XXX on the end of the command.

Fix is here :- https://github.com/AaronNGray/grobid/commit/60d46b4948e221e24d749a758ee89c34b80cd81d

Note: there is atleast one other bug !

AaronNGray commented 1 year ago

@kermitt2 Can you please have a look at the next bug :-

an instance variable PDFALTOSaxHandler.block is being used without being assigned in qName.equals("String") case in startElement().

https://github.com/kermitt2/grobid/blob/cb39daccf1db20863f79dc6ab383262556d531d2/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java#L446

ERROR [2023-04-30 17:55:59,536] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! java.lang.NullPointerException: null
! at org.grobid.core.sax.PDFALTOSaxHandler.startElement(PDFALTOSaxHandler.java:446)
! at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
! at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at org.grobid.core.document.Document.parseInputStream(Document.java:308)
! at org.grobid.core.document.Document.parseInputStream(Document.java:313)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:353)
! ... 73 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: C:\Users\aaron\GitHub\grobid\grobid-home\tmp\bb2K5exRr9.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:365)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:95)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:132)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:111)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:208)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)

Looks like the handling of block needs some revising :- https://github.com/kermitt2/grobid/blob/cb39daccf1db20863f79dc6ab383262556d531d2/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java#L46

Siedlerchr commented 1 year ago

Still getting the bug in Version 0.7.3: in Docker

R [2023-10-13 10:39:20,247] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:417)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:95)
! at org.grobid.core.engines.HeaderParser.processing(HeaderParser.java:81)
! at org.grobid.core.engines.Engine.processHeader(Engine.java:417)
! at org.grobid.core.engines.Engine.processHeader(Engine.java:385)
! at org.grobid.service.process.GrobidRestProcessFiles.processStatelessHeaderDocument(GrobidRestProcessFiles.java:99)
! at org.grobid.service.GrobidRestService.processHeaderDocumentReturnBibTeX_post(GrobidRestService.java:187)
! at jdk.internal.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at 

> java.base/java.lang.reflect.Method.invoke(Method.java:568)

! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
kermitt2 commented 1 year ago

Hello @Siedlerchr

Can you share the PDF to reproduce the problem?