Open thelazydogsback opened 7 months ago
Any progress on this issue? We have the same problem.
I've run locally and stepped through the code. The OCR step seems to be returning an empty HTML body. And when I look at the Tika logs, the tika server is throwing an error when attempting to do the OCR. It may be related:
WARN [qtp487764004-32] 21:51:48,577 org.eclipse.jetty.server.handler.ContextHandler Unimplemented getRequestCharacterEncoding() - use org.eclipse.jetty.servlet.ServletContextHandler
INFO [qtp487764004-32] 21:51:48,583 org.apache.tika.server.core.resource.RecursiveMetadataResource /rmeta (autodetecting type)
ERROR [qtp487764004-32] 21:51:48,637 org.apache.pdfbox.pdmodel.font.PDType1Font Can't read the embedded Type1 font AAAAAB+Helvetica
java.io.IOException: Start marker missing
at org.apache.fontbox.pfb.PfbParser.parsePfb(PfbParser.java:147) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.fontbox.pfb.PfbParser.<init>(PfbParser.java:125) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:69) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1217) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:126) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:78) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadataToMetadataList(RecursiveMetadataResource.java:190) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
should draw image...
should draw image...
Attempted to resolve with updated Tika .jar file. See build here:
I'm using
&applyOcr=yes
, but there's no indication that any OCR is taking place. I'm getting back the HTML from PDF text ok, but pages that are images of (clear) text from my PDFs are completely skipped. I'm using the latest docker image from the notebook. thanks