kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 457 forks source link

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

Open curtkohler opened 6 years ago

curtkohler commented 6 years ago

I've been trying to write a quick PDF->XML Spark conversion job leveraging GROBID. I am remotely reading the PDFs and since the GROBID Java library doesn't support passing in ByteArrays, I decided to spin up a separate server with GROBID that I can send REST transform requests to instead of writing and reading tmp files on the Spark Nodes. While the code appears to work fine during simple testing, when scaling things up, the GROBID REST server quickly bogs down considerably and I see many Jetty timeouts, very long processing times, etc. The documentation doesn't really address many details about configuring the server and there are a number of moving parts (Jetty, Grobid, pdf2xml, etc.) in play, so I was hoping you might be able to provide some recommendations based on your experience.

For instance, assuming I have a virtual box with 8 virtual CPUs and 16GB of memory: How many concurrent connections would be to set in the grobid.properties? Would you lengthen pool max wait?
Would you modify any Java settings for memory allocation? Modify any of the Jetty settings? Etc.

Thanks in advance for any insight you can provide.

kermitt2 commented 6 years ago

Hello @curtkohler

Thanks for the issue!

Your issue made me realized that a change has been made when moving to Dropwizard leading to incorrect status code in the service responses for the PDF processing in version 0.5.0 and 0.5.1 (I overlooked this change when merging, the services for text processing were not changed and are OK).

Normally GROBID send a status 503 (service unavailable) when all the threads are used so that the client can wait a bit that some threads become available before re-sending the query. This is the way the service can scale, avoiding mega-queues of PDF queries at the server. The change made a runtime error for this case and accumulating queries is very likely the reason of these problems.

I've corrected this in the current master version of GROBID. In addition I completed the documentation: https://grobid.readthedocs.io/en/latest/Grobid-service/ I describe for each service the response status codes. I also added some more explanation here: https://grobid.readthedocs.io/en/latest/Grobid-service/#parallel-mode

In addition, I've written 3 clients that use the service in the foreseen scalable way:

Normally with a machine with 8 CPU you should get good performance with the default setting of GROBID and of these clients, you don't need to increase the size of the thread pool or the max wait. 16GB memory is enough for exploiting all your available threads.

In the next weeks, I will perform more large scale tests with these clients (with millions PDF) and I will report if everything is fine.

bnewbold commented 5 years ago

It's not "official", but I can describe a box set up that has been working ok for some time:

Single large worker host, running in a virtual machine: 30 cores, 2.1 GHz, 50 GByte RAM, slow spinning disk (not SSD). Python workers GET (HTTP) PDFs from remote storage and POST them to GROBID worker, then POST XML response elsewhere, so as little hits disk locally as possible. Run 50x python worker processes (synchronous, single threaded processes; plenty of RAM to do so). Not using the (new) python library, just requests.

org.grobid.max.connections=40 org.grobid.pool.max.wait=1 grobid.temp.path=/run/grobid/tmp org.grobid.service.is.parallel.execution=true (default)

As an environment variable: TMPDIR=/run/grobid/tmp/

Process called as: ./gradlew run --gradle-user-home .

File logging is disabled; console logging (at WARN level) goes through syslog and does end up on disk. With a slow spinning disk, the fact that pdftoxml could be impacting performance, but doesn't seem too bad.

In aggregate, it takes about 3 core-seconds per CPU to do fulltext extraction, and we can do a million PDFs in about 28 hours.