kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 448 forks source link

processFulltextDocument fails on 0.23% arXiv PDFs #1113

Open MarksonChen opened 5 months ago

MarksonChen commented 5 months ago

I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.

Running on MacOS M2 chip Java version: 17.0.10 Server started with Gradle (./gradlew run)

An example error log:

ERROR [2024-05-09 13:13:55,538] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
! at java.base/java.util.Objects.checkIndex(Objects.java:359)
! at java.base/java.util.ArrayList.get(ArrayList.java:427)
! at org.grobid.core.data.Note.getPageNumber(Note.java:77)
! at org.grobid.core.document.TEIFormatter.lambda$toTEITextPiece$0(TEIFormatter.java:1460)
! at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
! at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
! at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
! at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
! at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
! at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
! at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1461)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 78 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2708)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:320)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)

The 50 PDFs that failed:

https://arxiv.org/pdf/2202.03169 https://arxiv.org/pdf/2007.10408 https://arxiv.org/pdf/2008.08076 https://arxiv.org/pdf/2203.00397 https://arxiv.org/pdf/2202.00145 https://arxiv.org/pdf/2110.13423 https://arxiv.org/pdf/2006.16218 https://arxiv.org/pdf/2305.01868 https://arxiv.org/pdf/2206.11939 https://arxiv.org/pdf/1711.05715 https://arxiv.org/pdf/2110.11222 https://arxiv.org/pdf/2006.13025 https://arxiv.org/pdf/1902.00450 https://arxiv.org/pdf/2109.04212 https://arxiv.org/pdf/2105.14849 https://arxiv.org/pdf/cs/9906002 https://arxiv.org/pdf/2101.09398 https://arxiv.org/pdf/1911.00536 https://arxiv.org/pdf/1912.02762 https://arxiv.org/pdf/2104.07857 https://arxiv.org/pdf/2106.15093 https://arxiv.org/pdf/1901.09401 https://arxiv.org/pdf/2201.10129 https://arxiv.org/pdf/2010.04879 https://arxiv.org/pdf/1206.5241 https://arxiv.org/pdf/2203.14101 https://arxiv.org/pdf/1905.06214 https://arxiv.org/pdf/2205.05789 https://arxiv.org/pdf/1810.00953 https://arxiv.org/pdf/1910.11856 https://arxiv.org/pdf/1501.02876 https://arxiv.org/pdf/2202.01987 https://arxiv.org/pdf/2303.02186 https://arxiv.org/pdf/2010.05761 https://arxiv.org/pdf/2204.11918 https://arxiv.org/pdf/2002.12361 https://arxiv.org/pdf/1810.07311 https://arxiv.org/pdf/1905.03817 https://arxiv.org/pdf/1901.07846 https://arxiv.org/pdf/2202.03798 https://arxiv.org/pdf/1711.01244 https://arxiv.org/pdf/2006.03040 https://arxiv.org/pdf/2004.10964 https://arxiv.org/pdf/1803.00590 https://arxiv.org/pdf/1612.06109 https://arxiv.org/pdf/1704.03651 https://arxiv.org/pdf/1610.09534 https://arxiv.org/pdf/2202.03555 https://arxiv.org/pdf/2008.04990

kermitt2 commented 5 months ago

Hi @MarksonChen

This is normally fixed with https://github.com/kermitt2/grobid/pull/1075 Are you using the latest master version?

MarksonChen commented 5 months ago

Hi kermitt2,

Thank you for your reply. I was using 0.8.0.

However, after switching to the latest master version (using git clone https://github.com/kermitt2/grobid.git), 49 out of 50 papers listed above still cannot be parsed with processFulltextDocument.

kermitt2 commented 5 months ago

Thank you @MarksonChen for checking and reporting these arXiv error cases.

Indeed the problem is not related to the issue corresponding to https://github.com/kermitt2/grobid/pull/1075, sorry. I just pushed a quick fix and these files should work too.

MarksonChen commented 5 months ago

Hi, kermitt2, thank you so much for your speedy fix! The amount of continual work put into this open-source project has been remarkable. All 22085 fetchable arXiv PDFs can be parsed successfully with processFulltextDocument.

lfoppiano commented 5 months ago

@kermitt2 I have a dejavu on this issue while working on PR #1097 and #1099.

This happen, as far as I remember, when a note with the same "label" is identified in the text. So when the notes list is collected from the text, by using the int idx = clusterTokens.indexOf(matching.get()); without updating the position, will result in having the same note with the same positions.

https://github.com/kermitt2/grobid/blob/83f2c81a3580c052697ffb46949dfec3deb67f32/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1590

For the first article of the list, 2202.03169, happens because there are three notes with the same intervals. Maybe we could just filter them as an additional precaution.

I write here also some additional information, as I will forget in one hour. I've checked just one example, which is quite messed up:

image
TakeasanexamplethesetupinFigure2,whereaballcaxn,xsetup2oRfSercetpiorens3en.1t,thceauosbaslerfvaacttiornswatithimouetsateupntiqanudesetoTfakeasanexamplethes1e0tu2pin3F.2ig.uLrea2r,nwinhgerweitahbIanlltecravnentionsoverTime
t
t+
t t+1 t+1 106 Note that when two variables Ci and Cj can only be inter-
We consider a dataset D of tuples {x , x , I } where 
lfoppiano commented 4 months ago

I'm reopening this, I'm following up my last comment.

Avoiding the duplicated interval is done by updating the search space of the indexOf by reducing the list of tokens. However, I noticed that https://github.com/kermitt2/grobid/blob/83f2c81a3580c052697ffb46949dfec3deb67f32/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L1596 if the label is repeated in the same sentence we override the note. I think it would work fine by using the note identifier, which should be unique from notes point of view.

I'm submitting the PR with two fixes:

  1. avoid collecting the same position in the text when the note label is the same. So for example if we have This note1, and this note2, but back to the first note1, we would collect twice the offset of the first 1 label.
  2. update the labels2notes so that we use the identifier instead.