kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 458 forks source link

Grobid started to fail on some test PDFs with "Connection aborted" #1047

Closed Levalife closed 1 year ago

Levalife commented 1 year ago

I'm using Ubuntu 22.04.2 LTS and lfoppiano/grobid:0.7.3

java --version openjdk 15.0.10 2023-01-17

All worked well till yesterday. Nothing was changed. It started to fail on some test PDFs with errors:

10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processAffiliations HTTP/1.1" 200 358 "-" "python-requests/2.23.0" 5
10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processAffiliations HTTP/1.1" 200 95 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processAffiliations HTTP/1.1" 200 211 "-" "python-requests/2.23.0" 4
10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processAffiliations HTTP/1.1" 200 363 "-" "python-requests/2.23.0" 4
10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processCitation HTTP/1.1" 200 473 "-" "python-requests/2.23.0" 5
10.0.0.2 - - [29/Aug/2023:10:25:15 +0000] "POST /api/processCitation HTTP/1.1" 200 518 "-" "python-requests/2.23.0" 8
WARN  [2023-08-29 10:25:16,171] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: May not be a PDF file (continuing anyway)
WARN  [2023-08-29 10:25:16,171] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
WARN  [2023-08-29 10:25:16,171] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
WARN  [2023-08-29 10:25:16,171] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't find trailer dictionary
WARN  [2023-08-29 10:25:16,171] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
ERROR [2023-08-29 10:25:17,164] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 1. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin9296275133178436352.pdf, /opt/grobid/grobid-home/tmp/Rb9fzaY5kT.lxml, --timeout, 120]
ERROR [2023-08-29 10:25:17,164] org.grobid.core.process.ProcessPdfToXml: pdfalto return message: 

ERROR [2023-08-29 10:25:17,164] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:246)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:149)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:108)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:208)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
! at jdk.internal.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:505)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
! at java.base/java.lang.Thread.run(Thread.java:833)
10.0.0.2 - - [29/Aug/2023:10:25:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 64 "-" "python-requests/2.23.0" 1008
10.0.0.2 - - [29/Aug/2023:10:25:18 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 4708
WARN  [2023-08-29 10:25:18,282] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: May not be a PDF file (continuing anyway)
WARN  [2023-08-29 10:25:18,282] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
WARN  [2023-08-29 10:25:18,282] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
WARN  [2023-08-29 10:25:18,282] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't find trailer dictionary
WARN  [2023-08-29 10:25:18,282] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
ERROR [2023-08-29 10:25:19,280] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 1. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin10236448733109071638.pdf, /opt/grobid/grobid-home/tmp/SKfekQs7VA.lxml, --timeout, 120]
ERROR [2023-08-29 10:25:19,280] org.grobid.core.process.ProcessPdfToXml: pdfalto return message: 

ERROR [2023-08-29 10:25:19,281] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:246)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:149)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:108)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:208)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
! at jdk.internal.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:505)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
! at java.base/java.lang.Thread.run(Thread.java:833)
10.0.0.2 - - [29/Aug/2023:10:25:19 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 64 "-" "python-requests/2.23.0" 1007
WARN  [2023-08-29 10:25:20,402] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: May not be a PDF file (continuing anyway)
WARN  [2023-08-29 10:25:20,403] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
WARN  [2023-08-29 10:25:20,403] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
WARN  [2023-08-29 10:25:20,403] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't find trailer dictionary
WARN  [2023-08-29 10:25:20,403] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: Syntax Error: Couldn't read xref table
ERROR [2023-08-29 10:25:21,400] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 1. [/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /opt/grobid/grobid-home/tmp/origin6206698693538522582.pdf, /opt/grobid/grobid-home/tmp/jRRw4d4J6U.lxml, --timeout, 120]
ERROR [2023-08-29 10:25:21,400] org.grobid.core.process.ProcessPdfToXml: pdfalto return message: 

ERROR [2023-08-29 10:25:21,400] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:246)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:149)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:108)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:208)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
! at jdk.internal.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:505)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
! at java.base/java.lang.Thread.run(Thread.java:833)
10.0.0.2 - - [29/Aug/2023:10:25:21 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 64 "-" "python-requests/2.23.0" 1008
10.0.0.2 - - [29/Aug/2023:10:25:23 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 5566
10.0.0.2 - - [29/Aug/2023:10:25:30 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 6366
10.0.0.2 - - [29/Aug/2023:10:25:41 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 9817
10.0.0.2 - - [29/Aug/2023:10:25:53 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 7781
10.0.0.2 - - [29/Aug/2023:10:25:53 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 46793 "-" "python-requests/2.23.0" 32116
10.0.0.2 - - [29/Aug/2023:10:26:19 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 278246 "-" "python-requests/2.23.0" 61791
10.0.0.2 - - [29/Aug/2023:10:26:31 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 90964 "-" "python-requests/2.23.0" 10813
10.0.0.2 - - [29/Aug/2023:10:27:12 +0000] "POST /api/processCitation HTTP/1.1" 200 756 "-" "python-requests/2.23.0" 9
10.0.0.2 - - [29/Aug/2023:10:27:12 +0000] "POST /api/processCitation HTTP/1.1" 200 739 "-" "python-requests/2.23.0" 6
10.0.0.2 - - [29/Aug/2023:10:27:12 +0000] "POST /api/processCitation HTTP/1.1" 200 180 "-" "python-requests/2.23.0" 5
10.0.0.2 - - [29/Aug/2023:10:27:13 +0000] "POST /api/processCitation HTTP/1.1" 200 129 "-" "python-requests/2.23.0" 2
10.0.0.2 - - [29/Aug/2023:10:27:13 +0000] "POST /api/processCitation HTTP/1.1" 200 267 "-" "python-requests/2.23.0" 474
10.0.0.2 - - [29/Aug/2023:10:27:13 +0000] "POST /api/processCitation HTTP/1.1" 200 73 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:27:13 +0000] "POST /api/processCitation HTTP/1.1" 200 73 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:27:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 271 "-" "python-requests/2.23.0" 67426
10.0.0.2 - - [29/Aug/2023:10:27:25 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 99962 "-" "python-requests/2.23.0" 53615
10.0.0.2 - - [29/Aug/2023:10:27:33 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 28899 "-" "python-requests/2.23.0" 7144
10.0.0.2 - - [29/Aug/2023:10:27:44 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 49469 "-" "python-requests/2.23.0" 11372
10.0.0.2 - - [29/Aug/2023:10:28:02 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 54785 "-" "python-requests/2.23.0" 1929
10.0.0.2 - - [29/Aug/2023:10:28:04 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 41348 "-" "python-requests/2.23.0" 1618
10.0.0.2 - - [29/Aug/2023:10:28:06 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19744 "-" "python-requests/2.23.0" 1566
10.0.0.2 - - [29/Aug/2023:10:28:07 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 271 "-" "python-requests/2.23.0" 68300
10.0.0.2 - - [29/Aug/2023:10:28:08 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 38789 "-" "python-requests/2.23.0" 1629
10.0.0.2 - - [29/Aug/2023:10:28:10 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 1651
10.0.0.2 - - [29/Aug/2023:10:28:15 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 210924 "-" "python-requests/2.23.0" 4434
10.0.0.2 - - [29/Aug/2023:10:28:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 27714 "-" "python-requests/2.23.0" 1722
10.0.0.2 - - [29/Aug/2023:10:28:19 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 55729 "-" "python-requests/2.23.0" 1829
10.0.0.2 - - [29/Aug/2023:10:28:21 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 34468 "-" "python-requests/2.23.0" 1614
10.0.0.2 - - [29/Aug/2023:10:28:23 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 40840 "-" "python-requests/2.23.0" 1712
10.0.0.2 - - [29/Aug/2023:10:28:26 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 65030 "-" "python-requests/2.23.0" 2060
10.0.0.2 - - [29/Aug/2023:10:28:28 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 19139 "-" "python-requests/2.23.0" 1647
10.0.0.2 - - [29/Aug/2023:10:28:49 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 110774 "-" "python-requests/2.23.0" 2623
10.0.0.2 - - [29/Aug/2023:10:28:50 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 271 "-" "python-requests/2.23.0" 65417
10.0.0.2 - - [29/Aug/2023:10:28:51 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 57684 "-" "python-requests/2.23.0" 1975
10.0.0.2 - - [29/Aug/2023:10:28:56 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 187029 "-" "python-requests/2.23.0" 4067
10.0.0.2 - - [29/Aug/2023:10:28:58 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 6970 "-" "python-requests/2.23.0" 1511
10.0.0.2 - - [29/Aug/2023:10:29:01 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 90022 "-" "python-requests/2.23.0" 2346
10.0.0.2 - - [29/Aug/2023:10:29:03 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 35833 "-" "python-requests/2.23.0" 1768
10.0.0.2 - - [29/Aug/2023:10:29:05 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 39587 "-" "python-requests/2.23.0" 2059
10.0.0.2 - - [29/Aug/2023:10:29:09 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 144038 "-" "python-requests/2.23.0" 2858
10.0.0.2 - - [29/Aug/2023:10:29:11 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 37341 "-" "python-requests/2.23.0" 1946
10.0.0.2 - - [29/Aug/2023:10:29:13 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 91027 "-" "python-requests/2.23.0" 1987
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 80 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 181 "-" "python-requests/2.23.0" 2
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 171 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 186 "-" "python-requests/2.23.0" 2
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 109 "-" "python-requests/2.23.0" 3
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 186 "-" "python-requests/2.23.0" 2
10.0.0.2 - - [29/Aug/2023:10:29:14 +0000] "POST /api/processAffiliations HTTP/1.1" 200 345 "-" "python-requests/2.23.0" 4
10.0.0.2 - - [29/Aug/2023:10:29:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 49750 "-" "python-requests/2.23.0" 2405
10.0.0.2 - - [29/Aug/2023:10:29:19 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 58867 "-" "python-requests/2.23.0" 2324

From the application point of view it looks like reason=ConnectionError(ProtocolError('Connection aborted.', timeout('The write operation timed out')

Sometimes it fails on several pdfs, after service restart it always fails on 1 pdf minimum. It doesn't look like the server is overloaded. What can be wrong here?

kermitt2 commented 1 year ago

Hello @Levalife

This error means that the parsing of the PDF with pdfalto fails. Often it means that the PDF is not parsable/ill-formed/corrupted. It could also mean that the pdfalto process uses too much memory (max memory is defined in the grobid config file) and is killed to protect the grobid server.

Apparently these PDF are corrupted from the warning message before the error:

Syntax Warning: PDF file is damaged

From experience, we always see some amount of PDF failing like this when the PDF comes from the internet wild.

Levalife commented 1 year ago

@kermitt2 Changed the focus of the issue a little bit. The problem is with old PDFs that parsed before and now are randomly getting "The write operation timed out"

kermitt2 commented 1 year ago

mmm what is producing this timeout message "The write operation timed out"? Which client are you using to query the Grobid server? It the timeout comes from your client, you could increase its value?

In the Grobid service, if the PDF parsing of Grobid reach a timeout, the error message would be something like "PDF to XML conversion timed out" associated with the error 500.

If you can share one of these failing PDF, I could try to reproduce the problem.

Levalife commented 1 year ago

It was a strange glitch in our python client app that was solved after another branch merge. Thank you for your time!