tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.34k stars 224 forks source link

memory consumption #109

Open xf0e opened 5 years ago

xf0e commented 5 years ago

Hey tleyden!

I'm looking into memory consumption by both the worker and the http daemon. And in can be very high. Starting with a low memory profile just under 1MB it grows with every request. On my test the memory consumption grew to over 2GB for 10 requests with TIFF image about 30Mb of size. I tried to debug a little bit, but lacking the experience I only found out - the most memory is allocated on the heap by make.Slice. This example is showing consumption after one Request with 30Mb payload total.

File: cli_httpd
Type: inuse_space
Time: Dec 14, 2018 at 3:32pm (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 83.50MB, 100% of 83.50MB total
Showing top 20 nodes out of 25
      flat  flat%   sum%        cum   cum%
      83MB 99.40% 99.40%       83MB 99.40%  bytes.makeSlice
    0.50MB   0.6%   100%     0.50MB   0.6%  net/http.init.2
         0     0%   100%       32MB 38.32%  bytes.(*Buffer).ReadFrom
         0     0%   100%       51MB 61.08%  bytes.(*Buffer).Write
         0     0%   100%       83MB 99.40%  bytes.(*Buffer).grow
         0     0%   100%       51MB 61.08%  encoding/base64.(*encoder).Write
         0     0%   100%       51MB 61.08%  encoding/json.(*encodeState).marshal
         0     0%   100%       51MB 61.08%  encoding/json.(*encodeState).reflectValue
         0     0%   100%       51MB 61.08%  encoding/json.(*structEncoder).encode
         0     0%   100%       51MB 61.08%  encoding/json.(*structEncoder).encode-fm
         0     0%   100%       51MB 61.08%  encoding/json.Marshal
         0     0%   100%       51MB 61.08%  encoding/json.encodeByteSlice
         0     0%   100%       83MB 99.40%  github.com/xf0e/open-ocr.(*OcrHttpHandler).ServeHTTP
         0     0%   100%       32MB 38.32%  github.com/xf0e/open-ocr.(*OcrRequest).downloadImgUrl
         0     0%   100%       83MB 99.40%  github.com/xf0e/open-ocr.(*OcrRpcClient).DecodeImage
         0     0%   100%       83MB 99.40%  github.com/xf0e/open-ocr.HandleOcrRequest
         0     0%   100%       32MB 38.32%  github.com/xf0e/open-ocr.url2bytes
         0     0%   100%       32MB 38.32%  io/ioutil.ReadAll
         0     0%   100%       32MB 38.32%  io/ioutil.readAll
         0     0%   100%     0.50MB   0.6%  main.init
(pprof) list makeSlice
Total: 83.50MB
ROUTINE ======================== bytes.makeSlice in /usr/lib/go/src/bytes/buffer.go
      83MB       83MB (flat, cum) 99.40% of Total
         .          .    226:   defer func() {
         .          .    227:           if recover() != nil {
         .          .    228:                   panic(ErrTooLarge)
         .          .    229:           }
         .          .    230:   }()
      83MB       83MB    231:   return make([]byte, n)
         .          .    232:}
         .          .    233:
         .          .    234:// WriteTo writes data to w until the buffer is drained or an error occurs.
         .          .    235:// The return value n is the number of bytes written; it always fits into an
         .          .    236:// int, but it is int64 to match the io.WriterTo interface. Any error
(pprof) list DecodeMessage
Total: 83.50MB

I'm trying to understand where additional 50 Mb came from, but i can't figure it out. I assume the memory load comes from storing the payload as []byte by the function url2bytes.

Is it possible to rewrite the function to store the payload on a file system and passing over rabbitmq the path to the file instead of the whole payload?

The memory consumption by the worker is even bigger. I think for the same reason, but i can't prove it because I was not able to get the pprof data for this part of open-ocr.

I would try to rewrite the code. I just need to know which parts are involved(I don't understand if the rabbitmq part will be involved).

Are you interested in elaborating this problem together and provide me some hints?

Regards!

tleyden commented 5 years ago

Is it possible to rewrite the function to store the payload on a file system and passing over rabbitmq the path to the file instead of the whole payload?

I don't see how that will work since typically the processes will be running on different machines. It would only work if there was a shared/common filesystem, which won't be on every system.

tleyden commented 5 years ago

On my test the memory consumption grew to over 2GB for 10 requests with TIFF image about 30Mb of size.

Can you give details on how you are determining the memory usage?

Also, are you sure these requests are being serviced? Eg, they aren't just getting stuck into rabbitmq without any workers to process? In that case, I'd expect the memory to increase in the way you observed.

I did a quick pass over the code and no obvious memory leaks jumped out at me.

xf0e commented 5 years ago

Hm, you're right, i forgot to take this into account... Need to rethink the idea, there must be a solution for this.

Can you help me to find there(in the case above) additional 50Mb come from? This alone would drop memory consumption by 2/3. I tried to investigate further and found out, that memory is returned to the OS after 8-10 Minutes upon OCR text delivery to the client.

xf0e commented 5 years ago

Sorry, i accidentally closed the issue. It was not intended.

My testing procedure:

All request got valid reply from open-ocr and the memory usage went down to just 40Mb and 9Mb after ca. 10 Minutes upon delivery.

tleyden commented 5 years ago

All request got valid reply from open-ocr and the memory usage went down to just 40Mb and 9Mb after ca. 10 Minutes upon delivery.

So it sounds like you are saying the memory is eventually being reclaimed to the OS? Are you just trying to avoid the memory spikes from happening at all? Or are you seeing what appears to be a leak where it just keeps increasing. You mentioned you saw it grow to 2 GB.

tleyden commented 5 years ago

The gctrace section here might shed some more light on it: https://dave.cheney.net/tag/godebug

xf0e commented 5 years ago

Where are two problems with memory footprint.

You can see how the memory footprint looks over the time(thanks for gctrace link!). The memory used by cli_worker won't be released completely.

cli_httpd cli_worker

Thanks!

tleyden commented 5 years ago

The non successful requests(with timeout) are still holding the memory.

I think the problem is here:

https://github.com/tleyden/open-ocr/blob/4fd4a2ac2a09c30684729d9c41d145ebafa5d4cc/ocr_rpc_client.go#L215

If it times out waiting for a worker response, then nothing will be listening on that rpcResponseChan and so rpcResponseChan <- ocrResult will block forever.

Do you see "send result to rpcResponseChan" messages without corresponding "sent result to rpcResponseChan" messages?

Alternatively, do you see extra goroutines stuck there? (here are some ways to dump the goroutine stacks)

xf0e commented 5 years ago

This is the output. Last succeeded ocr request and the rest what didn't make in time:

22:57:28.143282 OCR_CLIENT: got 20356B delivery: [1] "Lorem ipsum" 22:57:28.143302 OCR_CLIENT: send result to rpcResponseChan 22:57:28.143304 OCR_CLIENT: sent result to rpcResponseChan 22:57:28.144554 OCR_HTTP: ocrResult: {Lorem Ipsum } 22:57:34.550229 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 22:57:34.550242 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 22:57:36.157883 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 22:57:36.157894 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 22:57:37.794640 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 22:57:37.794651 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 22:57:39.532304 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 22:57:39.532317 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40

There is no "send result to rpcResponseChan" at all.

tleyden commented 5 years ago

I wonder if the memory leak would be reproducible in a simpler case -- what if there is no worker running? Do you see the memory just keep increasing as you add jobs that are never processed?

xf0e commented 5 years ago

There are minimum 2 cases with leaking memory. Both could be reproduced really easy.

The first case:

what if there is no worker running?

only cli_httpd and rabbitmq running:

Then the all requests timeouts, the cli_httpd will release the memory at some point: image

21:26:27.431007 OCR_HTTP: Starting listener on :8080 21:26:58.560339 OCR_HTTP: serveHttp called 21:26:58.561176 OCR_CLIENT: dialing "amqp://guest:guest@localhost:5672/" 21:26:58.569238 OCR_CLIENT: callbackQueue name: amq.gen-ZpRyOndmB4PozCcbS6ZOGg 21:26:58.569582 OCR_CLIENT: looping over deliveries.. 21:26:58.648990 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:26:58.649003 OCR_CLIENT: publishing with routing key "decode-ocr" 21:26:58.649008 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:27:50.653017 OCR_HTTP: serveHttp called 21:27:50.653074 OCR_CLIENT: dialing "amqp://guest:guest@localhost:5672/" 21:27:50.657004 OCR_CLIENT: callbackQueue name: amq.gen-gfWBqIeFQ6RMFHzS0KoSXA 21:27:50.657382 OCR_CLIENT: looping over deliveries.. 21:27:50.684407 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:27:50.684417 OCR_CLIENT: publishing with routing key "decode-ocr" 21:27:50.684431 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:28:01.923367 OCR_HTTP: serveHttp called 21:28:01.923432 OCR_CLIENT: dialing "amqp://guest:guest@localhost:5672/" 21:28:01.928467 OCR_CLIENT: callbackQueue name: amq.gen-lIBccn0D6zWnrXaBaedItw 21:28:01.928843 OCR_CLIENT: looping over deliveries.. 21:28:01.966831 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:28:01.966842 OCR_CLIENT: publishing with routing key "decode-ocr" 21:28:01.966846 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: [] 21:28:58.778786 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 21:28:58.778826 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 21:29:50.780741 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 21:29:50.780766 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 GC forced 21:30:02.062150 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80 21:30:02.062169 ERROR: Unable to perform OCR decode. Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40 GC forced GC forced scvg2: 0 MB released GC forced GC forced scvg3: 382 MB released

The rabbitmq will preserve the messages: image

What we are missing at this point is deleting corresponding messages from the rabbitmq queue.

Then we start cli_worker now it immediately gets all messages from the queue an performs ocr on it. Which is bad, since no one will ever get them. But there is no memory leakage in this case image

21:43:44.248453 OCR_WORKER: Creating new OCR Worker 21:43:44.248482 OCR_WORKER: Run() called... 21:43:44.248496 OCR_WORKER: dialing "amqp://guest:guest@localhost:5672/" 21:43:44.253650 OCR_WORKER: got Connection, getting Channel 21:43:44.254781 OCR_WORKER: binding to: decode-ocr 21:43:44.255152 OCR_WORKER: Queue bound to Exchange, starting Consume (consumer tag "foo") 21:43:44.354363 OCR_WORKER: got 40648128 byte delivery: [1]. Routing key: decode-ocr Reply to: amq.gen-ZpRyOndmB4PozCcbS6ZOGg 21:43:44.841885 OCR_TESSERACT: Use tesseract with bytes image 21:43:44.851449 OCR_TESSERACT: cmdArgs: [/tmp/23e0f1dd-85ca-4a31-772a-91960491d9db /tmp/23e0f1dd-85ca-4a31-772a-91960491d9db] 21:44:09.915722 OCR_TESSERACT: checking if exists: /tmp/23e0f1dd-85ca-4a31-772a-91960491d9db.txt 21:44:09.917860 OCR_WORKER: Sending rpc response: {Lorem Ipsum} 21:44:09.917870 OCR_WORKER: sendRpcResponse to: amq.gen-ZpRyOndmB4PozCcbS6ZOGg 21:44:09.917975 OCR_WORKER: sendRpcResponse succeeded 21:44:09.917982 OCR_WORKER: got 40648128 byte delivery: [2]. Routing key: decode-ocr Reply to: amq.gen-gfWBqIeFQ6RMFHzS0KoSXA 21:44:10.349823 OCR_TESSERACT: Use tesseract with bytes image 21:44:10.359666 OCR_TESSERACT: cmdArgs: [/tmp/8752631a-1be0-4905-74de-8f262e405049 /tmp/8752631a-1be0-4905-74de-8f262e405049] 21:44:36.289734 OCR_TESSERACT: checking if exists: /tmp/8752631a-1be0-4905-74de-8f262e405049.txt 21:44:36.292069 OCR_WORKER: Sending rpc response: {Lorem Ipsum} 21:44:36.292081 OCR_WORKER: sendRpcResponse to: amq.gen-gfWBqIeFQ6RMFHzS0KoSXA 21:44:36.293107 OCR_WORKER: sendRpcResponse succeeded GC forced GC forced scvg1: 0 MB released GC forced scvg2: 231 MB released GC forced

The second case

Is described before this post. We have cli_httpd, cli_worker and rabbitmq running. Now we are pushing just enough requests into the queue there at least one won't make in time. If we got enough RAM on cli_httpd and rabbitmq machines we won't crash here. Now the workers kicks in... All request which timeouts will not release the memory which belonging to such requests. Actually really ease to reproduce:

  1. start one cli_httpd, one cli_worker and the rabbitmq. Standard timeout if 120 Seconds in the sources.
  2. now on one tty or more just call curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://localhost:8000/testimage.tif","engine":"tesseract"}' http://localhost:8080/ocr X times. There X is depends on the test image. In my case it is a 30Mb image with 6 pages on it. The 4th or 5th and on wards requests(on my test machine) will receive a timeout and cli_worker will leak memory for such requests.
xf0e commented 5 years ago

I got a workaround for leaking memory by cli_worker, but I think we should address the actual issue, since this workaround would not help if the request which is being processed times out.

w.channel.Qos(1, 0, true) in orc_rpc_worker.go in Run() function will prevent worker for getting more than one message at a time from the queue.

tleyden commented 5 years ago

Ok thanks for the explanations and the steps to repro.