nsqio / go-nsq

The official Go package for NSQ
MIT License
2.59k stars 444 forks source link

consumer: memory use (possible leak?) in readLoop and ReadResponse #183

Closed bctsui closed 8 years ago

bctsui commented 8 years ago

The is similar to #174 but I feel it's different enough to warrant its own issue.

We have a consumer in production running with client v1.0.6 and its memory usage is slowly increasing over time -- it starts at <100MB and I've seen it reach over 100 GB in 24 hours.

It seems that the vast majority of memory in use (93%) originates from github.com/bitly/go-nsq.(*Conn).readLoop.

Here is a profile:

$ go tool pprof --inuse_space bin-consumer mem.pprof
Entering interactive mode (type "help" for commands)
(pprof) top2
7.58GB of 8.14GB total (93.09%)
Dropped 143 nodes (cum <= 0.04GB)
Showing top 2 nodes out of 13 (cum >= 2.49GB)
      flat  flat%   sum%        cum   cum%
    5.09GB 62.46% 62.46%     7.58GB 93.10%  github.com/bitly/go-nsq.(*Conn).readLoop
    2.49GB 30.63% 93.09%     2.49GB 30.63%  github.com/bitly/go-nsq.ReadResponse

Here's readLoop (I've removed irrelevant lines for brevity):

(pprof) list readLoop
Total: 8.14GB
ROUTINE ======================== github.com/bitly/go-nsq.(*Conn).readLoop in /home/benny/apsalar.master/go/src/github.com/bitly/go-nsq/conn.go
    5.09GB     7.58GB (flat, cum) 93.10% of Total
         .          .    476:   for {
         .          .    477:       if atomic.LoadInt32(&c.closeFlag) == 1 {
         .          .    478:           goto exit
         .          .    479:       }
         .          .    480:
         .     2.49GB    481:       frameType, data, err := ReadUnpackedResponse(c)
         .          .    482:       if err != nil {
         .          .    504:           c.delegate.OnResponse(c, data)
         .          .    505:       case FrameTypeMessage:
    5.09GB     5.09GB    506:           msg, err := DecodeMessage(data)
         .          .    507:           if err != nil {
         .          .    518:
         .   512.01kB    519:           c.delegate.OnMessage(c, msg)

And here's ReadResponse:

(pprof) list ReadResponse
Total: 8.14GB
ROUTINE ======================== github.com/bitly/go-nsq.ReadResponse in /home/benny/apsalar.master/go/src/github.com/bitly/go-nsq/protocol.go
    2.49GB     2.49GB (flat, cum) 30.63% of Total
         .          .     46://    |  (int32) || (binary)
         .          .     47://    |  4-byte  || N-byte
         .          .     48://    ------------------------...
         .          .     49://        size       data
         .          .     50:func ReadResponse(r io.Reader) ([]byte, error) {
         .          .     51:   var msgSize int32
         .          .     52:
         .          .     53:   // message size
         .   512.01kB     54:   err := binary.Read(r, binary.BigEndian, &msgSize)
         .          .     55:   if err != nil {
         .          .     56:       return nil, err
         .          .     57:   }
         .          .     58:
         .          .     59:   // message binary data
    2.49GB     2.49GB     60:   buf := make([]byte, msgSize)

In our nsq consumer:

  1. We set max-in-flight to 12800.
  2. Counters based on incoming messages are incremented in-memory and flushed to the db once every 10 seconds.
  3. Messages are not marked finished until the db flush has been done.

Each nsq message body is only 12 bytes + 26-byte message header, and peak usage is about 6000 messages per second. The math would indicate < 3 MB of message data being held in memory during those 10 seconds, waiting for the messages to be marked as finished.

Any thoughts on what might be going on here?

bctsui commented 8 years ago

This ended up being an issue outside of go-nsq (not sure how the profiling didn't pick that up). Closing the issue.