kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
275 stars 74 forks source link

Confused about concurrency #61

Closed bryanyzhu closed 1 year ago

bryanyzhu commented 1 year ago

Hi, I was using the processFulltextDocument service of grobid_client_python on some pdfs. I found something weird about the concurrency number. I thought with higher concurrency, the processing time should reduce. But in my case, when I increase concurrency from 10 to 20, the processing time increases. My machine has 12-core CPU, and 128G memory, and I wasn't running other things at the same time.

On 770 pdf documents
Concurrency = 5: 3490.75 seconds
Concurrency = 10: 1528.22 seconds
Concurrency = 20: 2232.55 seconds

Another concern I have is, when I increase concurrency from 10 to 20. I received more 408 error (7 errors goes up to 25 errors). I thought concurrency only affects speed, but seems it also affects the processing outcome. @kermitt2 Can anyone share some insights on this? Thank you.

kermitt2 commented 1 year ago

Hi @bryanyzhu

You can have a look at the Grobid documentation about changing the concurrency: https://grobid.readthedocs.io/en/latest/Troubleshooting/#production-configuration

Are you using a GPU and the deep learning models? The GPU might become a bottleneck at some point.

408 usually comes from the PDF parsing part. Maybe with too many PDF parsing at the same time, each PDF parsing become slower and more PDF parsing timeout fire. Your PDF documents are too big for taking advantage of concurrency?

bryanyzhu commented 1 year ago

Thanks, @kermitt2 , I think you are right. My pdf documents might be too big. I will set the timeout a big larger to see if it can alleviate this issue.

Another thing is, I have started the grobid server by ./gradlew clean install and ./gradlew run`. I didn't use a container, nor did I specify GPU usage. But I saw that grobid called the GPU. Does Grobid detect GPU first, then decide to use which model? Thank you.

kermitt2 commented 1 year ago

My pdf documents might be too big.

Just as a comment, grobid is really targeting scholar publication like article, chapters - it does not work well on monograph (book, full conference processings, thesis) because there is no model ready at this level for the moment (to segment into chapters, etc.), no training data for these objects. Maybe if you have large PDF, the performance will not be very good.

But I saw that grobid called the GPU. Does Grobid detect GPU first, then decide to use which model?

No normally it does not call the GPU by default (only CRF with CPU). The models to be used are defined in the config file. The "Deep Learning ready" docker image on the contrary has everything to detect GPU and run the DL models on it. The configuration of this full docker image is already preset to use the 3-4 best DL models - either with CPU or with GPU if detected.

The DL demo on HuggingFace is actually running only with CPU and it's not that slow (to my surprise): https://huggingface.co/spaces/kermitt2/grobid

bryanyzhu commented 1 year ago

Thanks a lot for your response, it really helps! I will close the issue.