neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Scaling to create a proccess per cpu core overwhelms grobid service #50

Closed elshimone closed 11 months ago

elshimone commented 11 months ago

When I try and index PDFs on my development machine, I get frequent failures because the grobid service is overwhelmed by the number of processes allocated for ingesting the PDFs:

ERROR [2023-12-03 11:57:31,564] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.

The default concurrency setting in grobid is 10. On my machine os.cpu_count() returns 16, so we are creating more processes that available engines in the grobid pool.

Whilst this is not an issue in paperetl itself, I think anyone for whom os.cpu_count() returns > 10 will hit this issue. The impact could be mitigated by adding a note to the documentation to suggest users increase the default concurrency limit in grobid if they this error. https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration

I am happy to create a PR for this if you agree.

davidmezzetti commented 11 months ago

Happily would take a PR on this.

A brief note in this section would be great. https://github.com/neuml/paperetl#additional-dependencies