When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:
ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.
The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.
Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration
When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:
The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.
Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration
I am happy to create a PR for this if you agree.