Passing files directly into Grobid without downloading

kermitt2 / grobid_client_python

Python client for GROBID Web services

Apache License 2.0

275 stars 74 forks source link

Hey Grobid team,

Thanks again for these incredible tools. I've been testing out the Python client - and encountered an issue when passing a PDF as an argument both while using the CLI and Python. I didn't receive any output.

Sample code below

grobid_client --input ./resource/my.PDF --output ./out processFulltextDocument

I realized while debugging that L122 of the grobid_client.py file implies passing in a directory and not the file itself as in the below request.

grobid_client --input ./resource/mypdfdir --output ./out processFulltextDocument

On GCP, I was trying to pass files directly in Grobid without downloading them - which I would have to do with the current setup. Anyway to stream PDFs in Grobid ? Or to send them as file objects ? If not, I'll try to see if I can pull something off quickly and test it.

Hi @MatthieuMoullecDev !

This client takes indeed a directory as input/output, as documented, because this is directed to batch processing of many files.

For me this client is a basis that can be adapted to different usage scenario, so I tried to keep it simple, with zero external dependencies. You can use the client as a package and then call process_batch() or process_pdf() as it is convenient on set of files and pipeline.

You can probably start sending files while downloading to the Grobid server, but Grobid will only start processing a file when it is entirely uploaded (for stability/robustness and technical reasons). So the easiest for your scenario is probably to download a file, add it to an executor, and then delete the file when the result is ready.

From my experience, if no consolidation of citation is used, Grobid is faster to process a file than required to download a typical Unpaywall file.

kermitt2 / grobid_client_python

Passing files directly into Grobid without downloading #47