Open naarkhoo opened 6 days ago
Hi @naarkhoo, the default parameters of the langchain parser assumes that you're running Grobid in local at localhost:8070. See: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.parsers.grobid.GrobidParser.html
If this is the case, then to better investigate we would need to see the Grobid logs. If it's not the case you should follow the instruction at https://python.langchain.com/v0.2/docs/integrations/document_loaders/grobid/
The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/.
(Note: additional instructions can be found [here](https://python.langchain.com/v0.2/docs/integrations/providers/grobid/).)
Once grobid is up-and-running you can interact as described below.
thanks for your response I do have grobid server running through docker in the background and can parse other pdf files but not these two specific ones. I can share the log if it should be needed.
On Tue, Jun 25, 2024 at 3:31 PM Luca Foppiano @.***> wrote:
Hi @naarkhoo https://github.com/naarkhoo, the default parameters of the langchain parser assumes that you're running Grobid in local at localhost:8070. See: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.parsers.grobid.GrobidParser.html
If this is the case, then to better investigate we would need to see the Grobid logs. If it's not the case you should follow the instruction at https://python.langchain.com/v0.2/docs/integrations/document_loaders/grobid/
The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/.
(Note: additional instructions can be found here.)
Once grobid is up-and-running you can interact as described below.
— Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/1134#issuecomment-2188977087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABWWO2OMAF6F4YKMZMVNO3ZJFWKZAVCNFSM6AAAAABJ3ZYDZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBYHE3TOMBYG4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi @naarkhoo, I cannot access the document, so for the moment, could you please share the log here?
18628819.pdf sure; thanks for asking
here is the log for 8440333
ERROR [2024-06-26 07:18:39,587] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:417)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:95)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:150)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:416)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:385)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:272)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:140)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:833)
I also attached the PDF file
Thanks. I checked them and:
18629919.pdf
works fine. 8440333.pdf
is an image so it's normal that there is no output, but the error message is saying that: [NO_BLOCKS] PDF parsing resulted in empty content
😄 Maybe the langchain parser need to handle these cases. Thank you for looking into them.
so you mean Grobid doesn't have OCR engine and is only a layout parse ?!
interesting, that you say 18629919
works - I am running through langchain and it does not return any output; there must be some issue within the langchain then.
loader = GenericLoader.from_filesystem(
path = '/data/all/18628819/',
suffixes=[".pdf"],
glob="**/[!.]*",
parser=GrobidParser(segment_sentences=True),
show_progress=True,
)
documents = loader.load()
I can make an issue on their repo and refer to this conversation.
@naarkhoo One option may be that you hit the timeout, could you please confirm that you are not getting any error message from langchain?
Something like: GROBID server timed out. Return None.
?
I am using
grobid
throughlangchain
and have observed a weird behavior I hope you have priviliage to access the following paperspubmed.ncbi.nlm.nih.gov/8440333
pubmed.ncbi.nlm.nih.gov/18628819
for some reason if I usedoes not return anything but if It works through
pypdfparser
I wonder what could be the underlying reason ?