kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

PDF results in unexpected exception: IllegalArgumentException #1024

Open gjreda opened 1 year ago

gjreda commented 1 year ago

I'm running Grobid via the lfoppiano/grobid:0.7.3-arm docker container on an M1 MacbookAir with macOS 13.3.1.

When trying to run Grobid against a particular PDF, I receive an unexpected exception due to an IllegalArgumentException.

Is there something particular about this PDF that causes Grobid to fail? I'd love to understand the exception more so I can watch out for similar ones within the project I'm working on.

Steps to reproduce

$ docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.3-arm

Download this PDF and, using the python client, call the Grobid server:

In [1]: from grobid_client.grobid_client import GrobidClient

In [2]: client = GrobidClient(grobid_server='http://localhost:8070')
GROBID server is up and running

In [3]: client.process('processFulltextDocument', "/Users/greg/Library/Application Support/com.tauri.dev/project-x/uploads", output="./output", force=True)
Processing of /Users/greg/Library/Application Support/com.tauri.dev/project-x/uploads/FActScore-.Fine-grained.Atomic.Evaluation.of.Factual.Precision.in.Long.Form.Text.Generation.pdf failed with error 500 , [GENERAL] An exception occurred while running Grobid.

It seems pdfalto_server seems to run correctly against the PDF and create the appropriate lxml files:

$ docker exec -it 3e5f2d283ef5 /bin/bash

root@3e5f2d283ef5:/opt/grobid# apt-get update
root@3e5f2d283ef5:/opt/grobid# apt-get install curl
root@3e5f2d283ef5:/opt/grobid# curl -o /tmp/test.pdf https://arxiv.org/pdf/2305.14251.pdf
root@3e5f2d283ef5:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/test.pdf /tmp/test.lxml --timeout 120
root@3e5f2d283ef5:/opt/grobid# ls -la /tmp/
total 2760
drwxrwxrwt 1 root root    4096 Jun  6 21:11 .
drwxr-xr-x 1 root root    4096 Jun  6 21:00 ..
drwxr-xr-x 1 root root    4096 Jun  6 21:00 hsperfdata_root
-rw-r--r-- 1 root root  277223 Jun  6 21:11 test.lxml
-rw-r--r-- 1 root root   24878 Jun  6 21:11 test.lxml_annot.xml
-rw-r--r-- 1 root root     269 Jun  6 21:11 test.lxml_metadata.xml
-rw-r--r-- 1 root root 2497316 Jun  6 21:09 test.pdf

Below is the stacktrace that gets returned to the python client.

``` ERROR [2023-06-06 20:46:38,345] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. ! java.lang.IllegalArgumentException: fromIndex(85) > toIndex(84) ! at java.base/java.util.AbstractList.subListRangeCheck(AbstractList.java:509) ! at java.base/java.util.ArrayList.subList(ArrayList.java:1108) ! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1447) ! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:917) ! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2468) ! ... 76 common frames omitted ! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid. ! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2552) ! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:302) ! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:111) ! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507) ! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497) ! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:208) ! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268) ! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ! at java.base/java.lang.reflect.Method.invoke(Method.java:568) ! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:315) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:297) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:267) ! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) ! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) ! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) ! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473) ! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) ! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49) ! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) ! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35) ! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39) ! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) ! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311) ! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265) ! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) ! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89) ! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) ! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135) ! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) ! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) ! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) ! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) ! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) ! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) ! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239) ! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52) ! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703) ! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67) ! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56) ! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) ! at org.eclipse.jetty.server.Server.handle(Server.java:505) ! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) ! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) ! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) ! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) ! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132) ! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) ! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) ! at java.base/java.lang.Thread.run(Thread.java:833) ```
lfoppiano commented 1 year ago

I did debug this stuff, and I'm reporting it, not really sure what is the proper way to deal with this. There is an unlucky combination of factors, that makes this PDF failing.

1) There is an invalid tagging of part of the text in page 2 (3. We suggest ..... at the beginning of the second column), it's really bad luck here because from the text point of view it seems there are two notes coming in sequence at the bottom of the page:

2.  We  2.  2   2.  2.  2.  BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   0   4   .   1   8   0   1   0   0   1   <body>
FACTSCORE   with    factscore   F   FA  FAC FACT    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  NODIGIT 0   0   0   0   0   0   0   0   0   4   ,-  2   8   0   1   0   0   1   <body>
lowing  evaluation  lowing  l   lo  low lowi    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   0   4   no  0   8   0   1   0   0   1   <body>
without manual  without w   wi  wit with    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   0   4   .   1   6   0   1   0   0   1   <body>
2   perplexity.ai   2   2   2   2   2   BLOCKSTART  PAGEIN  SAMEFONT    LOWERFONT   0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   0   0   0   0   5   .   1   10  0   1   0   0   1   I-<footnote>
3.  We  3.  3   3.  3.  3.  BLOCKSTART  PAGEIN  SAMEFONT    HIGHERFONT  0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   0   5   .   1   10  0   0   0   0   1   <footnote>
for a   for f   fo  for for BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   0   5   (..,-   5   9   0   0   0   0   1   <footnote>
ended   generation) ended   e   en  end ende    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   0   5   )   1   9   0   0   0   0   1   <footnote>
estimator.  estimator.  estimator.  e   es  est esti    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   0   0   0   0   0   0   0   5   .   1   2   0   0   0   0   1   <footnote>
2   Related 2   2   2   2   2   BLOCKSTART  PAGEIN  NEWFONT HIGHERFONT  0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   0   0   0   0   6   no  0   10  0   0   0   0   1   I-<body>

The other footnote in page 2, is correctly recognised, but has the same label, 3:

that    we  that    t   th  tha that    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   1   9   --  2   8   0   1   0   0   1   <body>
rather  than    rather  r   ra  rat rath    BLOCKEND    PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   1   9   --. 3   9   0   1   0   0   1   <body>
3   Consisting  3   3   3   3   3   BLOCKSTART  PAGEIN  NEWFONT LOWERFONT   0   0   NOCAPS  ALLDIGIT    1   0   0   0   0   0   0   0   1   11  -(),    4   9   0   1   0   0   1   I-<footnote>
18-29   in  18-29   1   18  18- 18-2    BLOCKEND    PAGEEND SAMEFONT    SAMEFONTSIZE    0   0   ALLCAP  CONTAINSDIGITS  0   0   0   0   0   0   0   0   1   11  -.().().    8   10  0   1   0   0   1   <footnote>
Model-based Evaluation. model-based M   Mo  Mod Mode    BLOCKSTART  PAGESTART   NEWFONT HIGHERFONT  0   0   INITCAP NODIGIT 0   0   1   0   0   0   0   0   1   0   -.  2   8   0   0   0   0   1   I-<body>
learned models  learned l   le  lea lear    BLOCKIN PAGEIN  SAMEFONT    SAMEFONTSIZE    0   0   NOCAPS  NODIGIT 0   0   1   0   0   0   0   0   1   0   no  0   8   0   0   0   0   1   <body>
  1. In this page grobid, recognise two notes with the same label of value 3 (somehow, if it seems we loose the note with label 2, not sure how), then there is a mechanism to deal with them in the text and they are searched:
for (Note note : notesSamePage) {
                        Optional<LayoutToken> matching = clusterTokens
                            .stream()
                            .filter(t -> t.getText().equals(note.getLabel()) && t.isSuperscript())
                            .findFirst();

This is using the first note label which has value "3" to match the layout token text but matches it the right anchor in the text:

image

Then, the same happens to the second label (3), which matches two different notes. Then we get two intervals referring to the same note.

gjreda commented 1 year ago

Thank you for looking into this @lfoppiano! Sounds like it's just a poorly structured PDF and there's not much that Grobid can do.

kermitt2 commented 1 year ago

(re-opening to try to fix the the exception properly in this case)