kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
281 stars 74 forks source link

Error #74

Open NeoH2333 opened 5 months ago

NeoH2333 commented 5 months ago

Hello

I trust you are all well. I've been encountering an error for the past few days while attempting to process full text from a batch using the Python client. Despite my efforts, the error persists. My system specifications include an Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz with 8GB RAM. I've tried adjusting parameters such as concurrency in the grobid.yaml file, but unfortunately, this hasn't resolved the issue. I'm reaching out to see if there are any additional steps I can take to address this problem. Thank you for your assistance.

ERROR [2024-04-13 20:31:33,322] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.

lfoppiano commented 5 months ago

Hi @NeoH2333, the default config of the client config.json file uses a batch_size of 100 which is too big. This number should be consistent with the number in the grobid.yaml.

If this does not solve the problem, could you share more information, including both config.json and grobid.yaml files?

kermitt2 commented 5 months ago

Hello !

@NeoH2333 8GB is not enough for applying processFulltextDocument on more than one PDF at the same time in a safe manner, especially if you are using Deep Learning models on CPU only. Consider using 16GB if possible. Otherwise, set the --n argument of the client side to 1.

@lfoppiano batch_size is only for managing the acquisition of files by the ThreadPoolExecutor, it is not related to the server load or concurrency in grobid.yaml, it can stay at 100 or 1000 without any impact on the server (it will use just a bit more memory at client side to store the list of paths to the pdf).

lfoppiano commented 5 months ago

Ahh, sorry indeed, the batch_size does not impact the number of concurrent requests... 🙏 @NeoH2333 ignore my comment please.