kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.24k stars 435 forks source link

grobid does not return anything #1134

Open naarkhoo opened 6 days ago

naarkhoo commented 6 days ago

I am using grobid through langchain and have observed a weird behavior I hope you have priviliage to access the following papers pubmed.ncbi.nlm.nih.gov/8440333 pubmed.ncbi.nlm.nih.gov/18628819 for some reason if I use

loader = GenericLoader.from_filesystem(
        path = '/Users/alka/Devel/LiteGrave/data/all/8440333/',
        suffixes=[".pdf"],
        glob="**/[!.]*",
        parser=GrobidParser(segment_sentences=True),
        show_progress=True,
    )

documents = loader.load()

does not return anything but if It works through pypdfparser

from langchain.document_loaders.parsers.pdf import PyPDFParser

loader = GenericLoader.from_filesystem(
    path = '/Users/alka/Devel/LiteGrave/data/all/8440333/',
    glob="**/*.pdf",
parser=PyPDFParser()
)

I wonder what could be the underlying reason ?

lfoppiano commented 6 days ago

Hi @naarkhoo, the default parameters of the langchain parser assumes that you're running Grobid in local at localhost:8070. See: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.parsers.grobid.GrobidParser.html

If this is the case, then to better investigate we would need to see the Grobid logs. If it's not the case you should follow the instruction at https://python.langchain.com/v0.2/docs/integrations/document_loaders/grobid/

The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/.

(Note: additional instructions can be found [here](https://python.langchain.com/v0.2/docs/integrations/providers/grobid/).)

Once grobid is up-and-running you can interact as described below.
naarkhoo commented 6 days ago

thanks for your response I do have grobid server running through docker in the background and can parse other pdf files but not these two specific ones. I can share the log if it should be needed.

On Tue, Jun 25, 2024 at 3:31 PM Luca Foppiano @.***> wrote:

Hi @naarkhoo https://github.com/naarkhoo, the default parameters of the langchain parser assumes that you're running Grobid in local at localhost:8070. See: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.parsers.grobid.GrobidParser.html

If this is the case, then to better investigate we would need to see the Grobid logs. If it's not the case you should follow the instruction at https://python.langchain.com/v0.2/docs/integrations/document_loaders/grobid/

The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/.

(Note: additional instructions can be found here.)

Once grobid is up-and-running you can interact as described below.

— Reply to this email directly, view it on GitHub https://github.com/kermitt2/grobid/issues/1134#issuecomment-2188977087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABWWO2OMAF6F4YKMZMVNO3ZJFWKZAVCNFSM6AAAAABJ3ZYDZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBYHE3TOMBYG4 . You are receiving this because you were mentioned.Message ID: @.***>

lfoppiano commented 6 days ago

Hi @naarkhoo, I cannot access the document, so for the moment, could you please share the log here?

naarkhoo commented 5 days ago

18628819.pdf sure; thanks for asking

here is the log for 8440333

ERROR [2024-06-26 07:18:39,587] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:417)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:95)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:150)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:416)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:385)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:272)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:140)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:833)

I also attached the PDF file

18628819.pdf 8440333.pdf

lfoppiano commented 5 days ago

Thanks. I checked them and:

naarkhoo commented 5 days ago

Thank you for looking into them.

so you mean Grobid doesn't have OCR engine and is only a layout parse ?!

interesting, that you say 18629919 works - I am running through langchain and it does not return any output; there must be some issue within the langchain then.

loader = GenericLoader.from_filesystem(
        path = '/data/all/18628819/',
        suffixes=[".pdf"],
        glob="**/[!.]*",
        parser=GrobidParser(segment_sentences=True),
        show_progress=True,
    )

documents = loader.load()

I can make an issue on their repo and refer to this conversation.

lfoppiano commented 3 days ago

@naarkhoo One option may be that you hit the timeout, could you please confirm that you are not getting any error message from langchain?

Something like: GROBID server timed out. Return None.?