kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 458 forks source link

Memory limit for pdfalto subprocess in Grobid server not working with docker image #1036

Closed sanchay-hai closed 5 months ago

sanchay-hai commented 1 year ago

So we have a long run Grobid server process we run with Xmx18G. What we notice is that processing one batch of ~1000 pdfs consumes 7-10GB, but then processing the 2nd batch of ~1000 pdfs consumes another 7-10GB and eventually the server gets killed with OOM.

This is a consistent finding where the server keeps consuming more and more memory and needs to be restarted. Is there possibly a memory leak? Are there any knobs / workarounds we can play with?

lfoppiano commented 1 year ago

Hi @sanchay-hai,

Could you share the whole docker image identification? grobid/grobid:version or lfoppiano/grobid:version for example?

Could you also share the command with what you ran it? Did you use --init as discussed in #1035 (I supposed you did but I just ask to be sure 😅 )

The behavior you describe is very strange, let's see what the problem could be. Yesterday, when investigating the #1035 I processed twice about 800 PDF document and did not notice any memory leak, nor any OOM crashes.

Finally, do you have by any chance, the log file somewhere? I'd be interested in the latest part of it

sanchay-hai commented 1 year ago

The docker images I tried were lfoppiano/grobid:0.7.3 and lfoppiano/grobid:0.7.2.

Oh sorry I did not try --init yet. Let me try with that.

Thanks for the quick response!

sanchay-hai commented 1 year ago

@lfoppiano do you have any recommendations of how to decrease memory usage of the server container? Will decreasing batch_size (from 1000) help or decreasing concurrency (from 10) help?

lfoppiano commented 1 year ago

The docker images I tried were lfoppiano/grobid:0.7.3 and lfoppiano/grobid:0.7.2.

Oh sorry I did not try --init yet. Let me try with that.

Does this work?

@lfoppiano do you have any recommendations of how to decrease memory usage of the server container? Will decreasing batch_size (from 1000) help or decreasing concurrency (from 10) help?

You could perhaps use the same batch_size as the number of concurrent jobs allowed by the server. Please check here: https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration Although this won't make much difference because grobid, eventually does respond 503 if the capacity is passed.

sanchay-hai commented 1 year ago

Not sure still if --init works.

I used a beefier machine and gave the container 50GB -m 50GB -e JAVA_OPTS=-Xmx48G

Still the server container crashed after a while with OOM

sanchay-hai commented 1 year ago

Let me try with batch_size=10

lfoppiano commented 1 year ago

I'm still not sure I can picture everything you've done (settings, memory and so on).

Meanwhile, could you share the logs? I suppose you can see them in the docker console?

sanchay-hai commented 1 year ago

So here is what I'm trying at the moment

Seems to be holding up. But CPU usage has dropped significantly.

Attaching a snippet of the logs grobid_server_logs_snippet.txt

sanchay-hai commented 1 year ago

Ok this server also crashed with OOM... here are the logs between when it was killed and restarted at 2023-07-19T02:42:26

172.17.0.1 - - [19/Jul/2023:02:42:11 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 54260 "-" "python-requests/2.31.0" 1739
172.17.0.1 - - [19/Jul/2023:02:42:11 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 146897 "-" "python-requests/2.31.0" 2495
      1000 sequences labeled
172.17.0.1 - - [19/Jul/2023:02:42:12 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 417533 "-" "python-requests/2.31.0" 3480
172.17.0.1 - - [19/Jul/2023:02:42:18 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 402404 "-" "python-requests/2.31.0" 9170
172.17.0.1 - - [19/Jul/2023:02:42:21 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 1274210 "-" "python-requests/2.31.0" 11834
172.17.0.1 - - [19/Jul/2023:02:42:22 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 939059 "-" "python-requests/2.31.0" 12973
02:42:27.161 [main] INFO org.grobid.service.main.GrobidServiceApplication - Found conf path: /opt/grobid/grobid-home/config/grobid.yaml
02:42:27.163 [main] WARN org.grobid.service.main.GrobidServiceApplication - Running with default arguments: "server" "/opt/grobid/grobid-home/config/grobid.yaml"
02:42:27.163 [main] INFO org.grobid.service.main.GrobidServiceApplication - Configuration file: /opt/grobid/grobid-home/config/grobid.yaml
172.17.0.1 - - [19/Jul/2023:02:42:30 +0000] "GET /api/isalive HTTP/1.1" 200 4 "-" "python-requests/2.31.0" 52
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/affiliation-address/model.wapiti"
Model path: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/header/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/header/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/citation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/header/model.wapiti"
Model path: /opt/grobid/grobid-home/models/header/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/date/model.wapiti"
Model path: /opt/grobid/grobid-home/models/date/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/citation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/segmentation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/segmentation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/reference-segmenter/model.wapiti"
Model path: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/figure/model.wapiti"
Model path: /opt/grobid/grobid-home/models/figure/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/table/model.wapiti"
Model path: /opt/grobid/grobid-home/models/table/model.wapiti
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: You are not supposed to copy this document...
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 
172.17.0.1 - - [19/Jul/2023:02:42:57 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 58262 "-" "python-requests/2.31.0" 27218
172.17.0.1 - - [19/Jul/2023:02:42:58 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 71625 "-" "python-requests/2.31.0" 27431
172.17.0.1 - - [19/Jul/2023:02:42:58 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 105036 "-" "python-requests/2.31.0" 27777
sanchay-hai commented 1 year ago

So to me it does seem like there is a potential memory leak. Let me know if you have any clues, or want me to try something

lfoppiano commented 1 year ago

mmmm.... I'm not sure here what is the issue. I don't see any error or exceptions.

When the server crashes the docker restarts it automatically? Do you have the exception of the OOM?

Also in the log, I saw this:

WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: You are not supposed to copy this document...
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 

Which seems to be a protected PDF document. Could you try to find it back? Maybe this is causing the problems?

How many CPUs do you have allocated to the docker?

Could you also share the command line you use to run the docker image?

sanchay-hai commented 1 year ago

When the server crashes the docker restarts it automatically?

Yes, I have --restart

Do you have the exception of the OOM?

No unfortunately, I haven't seen an OOM exception in the logs

Maybe this is causing the problems?

Could this lead to a memory leak? How so? I can try to find it if you think this can cause a memory leak

How many CPUs do you have allocated to the docker?

16 vCPUs

command line you use to run the docker image?

docker run --name grobidserver --restart unless-stopped -d -t --init -p 8070:8070 -m 50GB -e JAVA_OPTS=-Xmx48G lfoppiano/grobid:0.7.3

sanchay-hai commented 1 year ago

I think maybe I figured out the reason for OOM kills --- the pdfalto processes consume a lot of memory. So if the main grobid-service process has Xmx of 48G, it will keep growing and growing. Then when the pdfalto process needs to consume memory, they don't have much room left. This causes the entire container to get OOM-killed. Also maybe why we don't see OOM stack trace in the main grobid-service process.

Unforunately, I tried with -m 50GB -e JAVA_OPTS=-Xmx12G and batch_size=50, even this server got OOM-killed. I monitored the memory and the grobid-service was hovering around 12-15G, while the pdfalto processes came and went. They do consume a lot of memory and must've spiked over 33-38G and hence the container got OOM-killed.

Any ideas how to reduce the memory consumption of the pdfalto processes?

lfoppiano commented 1 year ago

@sanchay-hai OK, in order to investigate on my side, I need a sample of your PDF documents. Could you share via public link, here or via email: FOPPIANO.Luca AT nims.go.jp.

kermitt2 commented 1 year ago

Hi @sanchay-hai

There is normally no memory leak with older version of Grobid, because I tested the server with millions of PDF during several days with JProfiler especially for this. However I did not test more recent versions and I did not test the server when running as docker container because all these tests are time consuming. I don't see anything particular added in the 2 last years that could lead to memory leaks.

First you can have a look at this about setting your server: https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration

You should not allocate memory to the JVM like this, the server needs free memory for other stuff like pdfalto, JNI and more generally Java uses extra memory outside this allocation. Most of the memory usage of Grobid is outside the JVM (machine learning stuff and pdf parsing with pdfalto). Basically you should not touch JAVA_OPTS (I never used neither --restart unless-stopped because the server should never crash :), nor -m 50GB because the container should simply use as much memory as needed).

WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: You are not supposed to copy this document...
WARN  [2023-07-19 02:42:38,073] org.grobid.core.process.ProcessPdfToXml: pdfalto stderr: 

These are normal warning from pdfalto, including the protected PDF (it might be parsed if I remember well, but you are normally not allowed to), nothing unusual.

In production, use modelPreload: true and wait for the models to be loaded, but it's not impacting memory usage afaik but it can make the restart problematic.

About the client: batch_size is just the list of PDF in the client queue (it's due to limitation of Python multithreading), it has no impact on memory and should stay very high like 1000. Use a concurrency parameter --n close to the number of available threads on your machine.

If you have 50GB memory, I don't think you should run with any problems with pdfalto, except if you are parsing a lot of exceptionally very long and large PDF with a high concurrency for the server. Normally scholar PDF are not documents with more than 1000 pages, so it is normally not a problem.

In case the problem is pdfalto memory usage (possible if you have mega-PDF files, but normally they are not supported by Grobid and not so frequent in the scholar world), there are two parameters to control it (see https://grobid.readthedocs.io/en/latest/Configuration/#pdf-processing).

These 2 parameters are used to protect the server, as circuit breakers. There are unfortunately a lot of problematic PDF around and the external process to parse PDF is used to protect the JVM to crash in scenario with millions of PDF to process. For me these default parameters were working fine when processing very large number of PDF.

sanchay-hai commented 1 year ago

I agree there may not be a memory leak. It is just that the pdfalto processes consume a lot of memory (maybe on large pdfs).

You should not allocate memory to the JVM like this .... nor -m 50GB because the container should simply use as much memory as needed

So I started with not assigning these at all, but that causes the machines to basically become unresponsive after a while. I guess because they started swapping out and thrashing badly. So I would prefer to give the container a max memory, kill it and restart. Xmx is also helpful IMO because otherwise the JVM will keep consuming more and more memory. I've noticed with Xmx48G, the grobid-service process itself grew to 35GB+ leaving even less memory for pdfalto processes.

batch_size has no impact on memory and should stay very high like 1000

Cool. I do notice OOM crashes more often with higher batch_size. But then throughput it reduced too.

memoryLimitMb: 6096

This helps a lot. So with concurrency: 10, that means worst case we can go up to 60GB in alto memory. If I give grobid-service 12GB, that means 72GB should suffice with default parameters.

Let me try allocating enough memory. Will keep you guys posted. Thanks a lot! Really appreciate the detailed responses!!

sanchay-hai commented 1 year ago

That didn't work either. I tried -m 112GB -e JAVA_OPTS="-Xmx24G" with batch_size=1000, n=10 and the server still crashed with OOM, I saw pdfalto memory grow above 90GB. How is that possible?

kermitt2 commented 1 year ago

I saw pdfalto memory grow above 90GB. How is that possible?

Good question, I've never seen anything like this happening so far - actually I've never needed to use Grobid on a machine with so much memory even when running with n=24 :)

To reproduce the problem, would it be possible to give me some more information about the kind of PDF you are processing (normal scholar article PDF or large phd thesis/books?) and how many PDF are processed to reach this amount of memory?

sanchay-hai commented 1 year ago

Here is the zip file of the pdfs that caused the above OOM (download link is valid for 12h). I was not able to reproduce the issue on my laptop. On my laptop the max memory usage was 19GB with --n 10 and 20GB with --n 20

lfoppiano commented 1 year ago

@sanchay-hai thanks for the files. Meanwhile, thanks to the favorable time zone, I did test my instance with your file.

I processed them all in 12 minutes, without any OOM or any issue. You can see the log of the grobid client here:

Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.2475/ajs.279.6.632.pdf failed with error 500 , [NO_BLOCKS] PDF parsing resulted in empty content
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315733814.pdf failed with error 500 , [GENERAL] An exception occurred while running Grobid.
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315773827.pdf failed with error 500 , [GENERAL] An exception occurred while running Grobid.
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315849522.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315772219.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315757728.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315660806.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315722610.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315643007.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315709536.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315765822.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315885681.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9780203121993.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315885087.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315796772.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315558653.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315744759.pdf failed with error 408 , None
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9780203754214.pdf failed with error 502 , <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/grobid/stable/api/processFulltextDocument">POST&nbsp;/grobid/stable/api/processFulltextDocument</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
</body></html>

Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9780203641828.pdf failed with error 500 , [GENERAL] An exception occurred while running Grobid.
Processing of /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324/9781315680019.pdf failed with error 408 , None
runtime: 760.024 seconds 

As a side note. The data under /Users/lfoppiano/Downloads/libgen.scimag64609000-64609999/10.4324 is composed by PDF documents with more than 100 pages.

I made a test with lfoppiano/grobid:0.7.3 deployed on a 12 CPU workstation via docker. I ran it with the grobid's memory normal settings. The machine has 62Gb of RAM. I don't have any particular settings to limit the memory of the docker service. Together with grobid I ran several other services on the same machine (including at least 3-4 grobid-and-modules instances).

My grobid service was running for 2-3 weeks already and I've been using for processing stuff, so it's not a fresh container. While doing my test, I kept an eye on the pdfalto processes and they were all coming and going without any problem.

sanchay-hai commented 1 year ago

Thank you, I also tried and was unable to reproduce on my laptop.

One thing to try to reproduce might be --- run the same zip in a loop. Maybe shuffle the pdf list each time

kermitt2 commented 1 year ago

Here is the zip file of the pdfs that caused the above OOM (download link is valid for 12h).

I was too slow the token has expired !

sanchay-hai commented 1 year ago

Here it is again. Feel free to download the zip.

kermitt2 commented 1 year ago

Here it is again. Feel free to download the zip.

Thank you @sanchay-hai, I got it !

sanchay-hai commented 1 year ago

Ok did a little more debugging.

In a couple of instances, I checked that when the container went OOM, it was one pdfalto process hogging up all the memory (e.g. growing above 90GB).

I know you said that the pdfalto is limited to 6GB, (and I see the code here, is that the right place?), but I'm not sure if the ulimit is getting applied. For example, I see this in the docker container (notice that the "Max address space" is unlimited)

Any ideas?

root@c77d2e482220:/opt/grobid# cat /proc/597/cmdline 
/opt/grobid/grobid-home/pdfalto/lin-64/pdfalto-fullFontName-noLineNumbers-noImage-annotation-filesLimit2000/opt/grobid/grobid-home/tmp/origin15935559128104841724.pdf/opt/grobid/grobid-home/tmp/cWdXpjHuZf.lxmlroot@c77d2e482220:/opt/grobid#

root@c77d2e482220:/opt/grobid# cat /proc/597/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             10485760             bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            65536                65536                files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       30446                30446                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
kermitt2 commented 1 year ago

@sanchay-hai Thanks a lot for this, I think you're totally right - this is the problem, the memory limit for the pdfalto process is not working in this case. I need to find the time to dive into this and explore a fix.

In principle, in server mode, we use the following shell to better control the child process and its termination:
https://github.com/kermitt2/grobid/blob/master/grobid-home/pdfalto/lin-64/pdfalto_server https://github.com/kermitt2/grobid/blob/master/grobid-home/pdfalto/mac-64/pdfalto_server This is likely the place to add the memory ulimit.

lfoppiano commented 1 year ago

@sanchay-hai I've gave some though to your problem:

kermitt2 commented 1 year ago

So it appears that the memory limit mechanism was not set in server mode. I pushed a fix with PR #1038 (adding the ulimit in the pdfalto_server script as mentioned above) and from my tests it is working correctly now both with java server and docker server. Setting the pdfalto memory limit to something very low in the Grobid config file results as expected in subprocess failing with large PDF.

@sanchay-hai could you test again maybe your set of PDF with the following updated docker image:

docker pull grobid/grobid-crf:0.8.0-SNAPSHOT
lfoppiano commented 1 year ago

I've updated the docker image: lfoppiano/grobid:0.7.3 and lfoppiano/grobid:0.7.3-arm

sanchay-hai commented 1 year ago

Sorry, I've been busy with other things. Thank you for fixing the issue, feel free to close it. I will get back on this thread if it doesn't work for me, next time I manage to test.

kermitt2 commented 1 year ago

No problem @sanchay-hai thanks a lot for this issue ! We normally let the person having opened the issue, closing it at her/his convenience.