nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.1k stars 158 forks source link

&applyOcr=yes - no OCR taking place (skipping image pages) #50

Open thelazydogsback opened 7 months ago

thelazydogsback commented 7 months ago

I'm using &applyOcr=yes, but there's no indication that any OCR is taking place. I'm getting back the HTML from PDF text ok, but pages that are images of (clear) text from my PDFs are completely skipped. I'm using the latest docker image from the notebook. thanks

yagobski commented 6 months ago

Any progress on this issue? We have the same problem.

jamesvillarrubia commented 5 months ago

I've run locally and stepped through the code. The OCR step seems to be returning an empty HTML body. And when I look at the Tika logs, the tika server is throwing an error when attempting to do the OCR. It may be related:

WARN  [qtp487764004-32] 21:51:48,577 org.eclipse.jetty.server.handler.ContextHandler Unimplemented getRequestCharacterEncoding() - use org.eclipse.jetty.servlet.ServletContextHandler
INFO  [qtp487764004-32] 21:51:48,583 org.apache.tika.server.core.resource.RecursiveMetadataResource /rmeta (autodetecting type)
ERROR [qtp487764004-32] 21:51:48,637 org.apache.pdfbox.pdmodel.font.PDType1Font Can't read the embedded Type1 font AAAAAB+Helvetica
java.io.IOException: Start marker missing
    at org.apache.fontbox.pfb.PfbParser.parsePfb(PfbParser.java:147) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.fontbox.pfb.PfbParser.<init>(PfbParser.java:125) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:69) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1217) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:126) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:78) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadataToMetadataList(RecursiveMetadataResource.java:190) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.tika.server.core.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
should draw image...
should draw image...
jamesvillarrubia commented 5 months ago

Attempted to resolve with updated Tika .jar file. See build here:

https://github.com/nlmatics/nlm-tika/pull/5