Open curtkohler opened 6 years ago
Hello @curtkohler
Thanks for the issue!
Your issue made me realized that a change has been made when moving to Dropwizard leading to incorrect status code in the service responses for the PDF processing in version 0.5.0 and 0.5.1 (I overlooked this change when merging, the services for text processing were not changed and are OK).
Normally GROBID send a status 503
(service unavailable) when all the threads are used so that the client can wait a bit that some threads become available before re-sending the query. This is the way the service can scale, avoiding mega-queues of PDF queries at the server. The change made a runtime error for this case and accumulating queries is very likely the reason of these problems.
I've corrected this in the current master version of GROBID. In addition I completed the documentation: https://grobid.readthedocs.io/en/latest/Grobid-service/ I describe for each service the response status codes. I also added some more explanation here: https://grobid.readthedocs.io/en/latest/Grobid-service/#parallel-mode
In addition, I've written 3 clients that use the service in the foreseen scalable way:
Normally with a machine with 8 CPU you should get good performance with the default setting of GROBID and of these clients, you don't need to increase the size of the thread pool or the max wait. 16GB memory is enough for exploiting all your available threads.
In the next weeks, I will perform more large scale tests with these clients (with millions PDF) and I will report if everything is fine.
It's not "official", but I can describe a box set up that has been working ok for some time:
Single large worker host, running in a virtual machine: 30 cores, 2.1 GHz, 50 GByte RAM, slow spinning disk (not SSD).
Python workers GET (HTTP) PDFs from remote storage and POST them to GROBID worker, then POST XML response elsewhere, so as little hits disk locally as possible. Run 50x python worker processes (synchronous, single threaded processes; plenty of RAM to do so). Not using the (new) python library, just requests
.
org.grobid.max.connections=40
org.grobid.pool.max.wait=1
grobid.temp.path=/run/grobid/tmp
org.grobid.service.is.parallel.execution=true
(default)
As an environment variable: TMPDIR=/run/grobid/tmp/
Process called as: ./gradlew run --gradle-user-home .
File logging is disabled; console logging (at WARN level) goes through syslog and does end up on disk. With a slow spinning disk, the fact that pdftoxml
could be impacting performance, but doesn't seem too bad.
In aggregate, it takes about 3 core-seconds per CPU to do fulltext extraction, and we can do a million PDFs in about 28 hours.
I've been trying to write a quick PDF->XML Spark conversion job leveraging GROBID. I am remotely reading the PDFs and since the GROBID Java library doesn't support passing in ByteArrays, I decided to spin up a separate server with GROBID that I can send REST transform requests to instead of writing and reading tmp files on the Spark Nodes. While the code appears to work fine during simple testing, when scaling things up, the GROBID REST server quickly bogs down considerably and I see many Jetty timeouts, very long processing times, etc. The documentation doesn't really address many details about configuring the server and there are a number of moving parts (Jetty, Grobid, pdf2xml, etc.) in play, so I was hoping you might be able to provide some recommendations based on your experience.
For instance, assuming I have a virtual box with 8 virtual CPUs and 16GB of memory: How many concurrent connections would be to set in the grobid.properties? Would you lengthen pool max wait?
Would you modify any Java settings for memory allocation? Modify any of the Jetty settings? Etc.
Thanks in advance for any insight you can provide.