Closed bryanyzhu closed 1 year ago
Hi @bryanyzhu
You can have a look at the Grobid documentation about changing the concurrency: https://grobid.readthedocs.io/en/latest/Troubleshooting/#production-configuration
Are you using a GPU and the deep learning models? The GPU might become a bottleneck at some point.
408 usually comes from the PDF parsing part. Maybe with too many PDF parsing at the same time, each PDF parsing become slower and more PDF parsing timeout fire. Your PDF documents are too big for taking advantage of concurrency?
Thanks, @kermitt2 , I think you are right. My pdf documents might be too big. I will set the timeout a big larger to see if it can alleviate this issue.
Another thing is, I have started the grobid server by ./gradlew clean install
and ./gradlew
run`. I didn't use a container, nor did I specify GPU usage. But I saw that grobid called the GPU. Does Grobid detect GPU first, then decide to use which model? Thank you.
My pdf documents might be too big.
Just as a comment, grobid is really targeting scholar publication like article, chapters - it does not work well on monograph (book, full conference processings, thesis) because there is no model ready at this level for the moment (to segment into chapters, etc.), no training data for these objects. Maybe if you have large PDF, the performance will not be very good.
But I saw that grobid called the GPU. Does Grobid detect GPU first, then decide to use which model?
No normally it does not call the GPU by default (only CRF with CPU). The models to be used are defined in the config file. The "Deep Learning ready" docker image on the contrary has everything to detect GPU and run the DL models on it. The configuration of this full docker image is already preset to use the 3-4 best DL models - either with CPU or with GPU if detected.
The DL demo on HuggingFace is actually running only with CPU and it's not that slow (to my surprise): https://huggingface.co/spaces/kermitt2/grobid
Thanks a lot for your response, it really helps! I will close the issue.
Hi, I was using the
processFulltextDocument
service ofgrobid_client_python
on some pdfs. I found something weird about the concurrency number. I thought with higher concurrency, the processing time should reduce. But in my case, when I increase concurrency from 10 to 20, the processing time increases. My machine has 12-core CPU, and 128G memory, and I wasn't running other things at the same time.Another concern I have is, when I increase concurrency from 10 to 20. I received more 408 error (7 errors goes up to 25 errors). I thought concurrency only affects speed, but seems it also affects the processing outcome. @kermitt2 Can anyone share some insights on this? Thank you.