kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 459 forks source link

facing BAD_INPUT_DATA error while extracting TEI XML #1110

Open bhargav-ss opened 6 months ago

bhargav-ss commented 6 months ago

Error Stack Trace

Traceback (most recent call last):
  File "/home/admin/miniconda3/envs/env3/lib/python3.9/site-packages/celery/app/trace.py", line 450, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/admin/miniconda3/envs/env3/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 204, in _inner
    reraise(*exc_info)
  File "/home/admin/miniconda3/envs/env3/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 56, in reraise
    raise value
  File "/home/admin/miniconda3/envs/env3/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 199, in _inner
    return f(*args, **kwargs)
  File "/home/admin/miniconda3/envs/env3/lib/python3.9/site-packages/celery/app/trace.py", line 731, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/admin/agg-backend/src/apps/grobid/tasks/command_pdf_to_xml_task.py", line 65, in run
    xml_extraction_result = super().run(*args, **kwargs)
  File "/home/admin/agg-backend/src/apps/grobid/tasks/pdf_to_xml_task.py", line 278, in run
    extraction_response = self.extract_xml_from_pdf(**kwargs_for_extraction)
  File "/home/admin/agg-backend/src/apps/grobid/tasks/pdf_to_xml_task.py", line 93, in extract_xml_from_pdf
    response = self.pdf_to_tei_xml_interface.get(**kwargs_for_pdf_to_tei)
  File "/home/admin/agg-backend/src/interfaces/v2_grobid/pdf_to_tei_xml_interface.py", line 106, in get
    raise GrobidServiceGenericException(
interfaces.v2_grobid.exceptions.GrobidServiceGenericException: Failed to extract xml from pdf. Status code: 500. Response: [BAD_INPUT_DATA] An error occurred while converting pdf /home/admin/grobid-installation/grobid-home/tmp/origin4301062083993218417.pdf

We are running a pipeline which processes large number of PDF files and extracts TEI XML via grobid service. I have observed this behaviour where after a certain point, the service starts giving BAD_INPUT_DATA error on a lot of requests. Once I destroy the server and spawn a fresh grobid instance, the service give successful extraction on same PDF file where it gave BAD_INPUT_DATA earlier.

bhargav-ss commented 6 months ago

Attaching a sample file where this happened: untitled-collection-3dmno1cz-l-nwn-lwl-5dy1w3k2pb.pdf

lfoppiano commented 6 months ago

@bhargav-ss could you share more information on the deployment of the server? Docker or native? How much resources are allocated (CPU/GPU, Ram). Also could you share the log from the service?

bhargav-ss commented 6 months ago

Thanks for the prompt response. It is a native deployment. We were using c6a.32xlarge for the pipeline which had 64 cores and 256 GB memory.

Unfortunately I haven't retained logs of grobid service itself when this failed. I can try running this again and reproduce the error.

bhargav-ss commented 6 months ago

For real-time grobid conversion in application, we are using c6a.2xlarge which has 4 CPU cores and 16 GB memory.

bhargav-ss commented 6 months ago

Got the error in one of our real-time services:

ERROR [2024-05-04 23:27:19,419] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 137. [/home/admin/grobid-installation/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/admin/grobid-installation/grobid-home/tmp/origin12655433907538568485.pdf, /home/admin/grobid-installation/grobid-home/tmp/7Jt7WRnCp6.lxml, --timeout, 120, --ulimit, 6242304]
ERROR [2024-05-04 23:27:19,419] org.grobid.core.process.ProcessPdfToXml: pdfalto return message:

ERROR [2024-05-04 23:27:19,419] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 137
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:248)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:151)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:116)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at jdk.internal.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
bhargav-ss commented 6 months ago

Got one more log with BAD_INPUT_DATA but with different error code 1, earlier error code was 137

ERROR [2024-05-04 23:02:43,244] org.grobid.core.process.ProcessPdfToXml: pdfalto process finished with error code: 1. [/home/admin/grobid-installation/grobid-home/pdfalto/lin-64/pdfalto_server, -fullFontName, -noLineNumbers, -noImage, -annotation, -filesLimit, 2000, /home/admin/grobid-installation/grobid-home/tmp/origin1334766109947129628.pdf, /home/admin/grobid-installation/grobid-home/tmp/4HnBtDNyDC.lxml, --timeout, 120, --ulimit, 6242304]
ERROR [2024-05-04 23:02:43,244] org.grobid.core.process.ProcessPdfToXml: pdfalto return message:

ERROR [2024-05-04 23:02:43,245] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1
! at org.grobid.core.document.DocumentSource.processPdfaltoServerMode(DocumentSource.java:248)
! at org.grobid.core.document.DocumentSource.pdfalto(DocumentSource.java:151)
! at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:64)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:116)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at jdk.internal.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:416)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:385)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:272)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:140)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)
kermitt2 commented 6 months ago

Hello @bhargav-ss The sampling pdf file is working on the demo https://kermitt2-grobid.hf.space/ and with the current master natively. Could you give more details on the version and how you are running the grobid server?

bhargav-ss commented 6 months ago

Our grobid version: 0.8.0

We are running grobid server as a systemd service.