kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
275 stars 74 forks source link

Passing files directly into Grobid without downloading #47

Open matthieu-perso opened 2 years ago

matthieu-perso commented 2 years ago

Hey Grobid team,

Thanks again for these incredible tools. I've been testing out the Python client - and encountered an issue when passing a PDF as an argument both while using the CLI and Python. I didn't receive any output.

Sample code below

grobid_client --input ./resource/my.PDF --output ./out processFulltextDocument

I realized while debugging that L122 of the grobid_client.py file implies passing in a directory and not the file itself as in the below request.

grobid_client --input ./resource/mypdfdir --output ./out processFulltextDocument

On GCP, I was trying to pass files directly in Grobid without downloading them - which I would have to do with the current setup. Anyway to stream PDFs in Grobid ? Or to send them as file objects ? If not, I'll try to see if I can pull something off quickly and test it.

kermitt2 commented 2 years ago

Hi @MatthieuMoullecDev !

This client takes indeed a directory as input/output, as documented, because this is directed to batch processing of many files.

For me this client is a basis that can be adapted to different usage scenario, so I tried to keep it simple, with zero external dependencies. You can use the client as a package and then call process_batch() or process_pdf() as it is convenient on set of files and pipeline.

You can probably start sending files while downloading to the Grobid server, but Grobid will only start processing a file when it is entirely uploaded (for stability/robustness and technical reasons). So the easiest for your scenario is probably to download a file, add it to an executor, and then delete the file when the result is ready.

From my experience, if no consolidation of citation is used, Grobid is faster to process a file than required to download a typical Unpaywall file.