uber / tchannel-node

MIT License
203 stars 40 forks source link

Memory usage spikes after connections get in bad state #83

Open markyen opened 9 years ago

markyen commented 9 years ago

After the error in uber/hyperbahn#22 occurs, the memory on several workers spike to several gigabytes:

image

A summary of core dumps taken from affected and normal workers are here: https://gist.github.com/markyen/67286b06b68d1b0b7685

TODO:

/cc @Raynos @ShanniLi

Raynos commented 9 years ago

Reap out requests every 30 seconds

Add a special case to operations._sweepOps to deal with out requests and purge anything where now - req.start < configurable sanity sweep ttl

The sanity sweep ttl should probably 30s; if any out req is longer then 30s get rid of it from memory.

This has to be configurable for hyperbahn and might be 2minutes; 5minutes; 10minutes; ???

Raynos commented 9 years ago

Remove arg2 and arg3 from req/res

On out_request, streaming_out_request, request.js and v2/out_request we want to ensure that the arg3 and arg2 fields are removed.

Holding on to buffers leaks slow buffers.

Leaking OutRequest should not leak buffers. At some point we should do the same for InRequest, InResponse and OutResponse but those will be a lot harder.

It's not save to just remove arg3 and arg2. We will have to audit the entire path from req.send() to conn.socket.write() to make sure we do not leak the buffers.

Raynos commented 9 years ago

Verify operations after error frame

There might be a weird race condition where:

I have no idea if this is true or what. But we need to verify that out of order error frames and continue frames do not leak a request in an operations data structure.

Raynos commented 8 years ago

Write to socket after close

In the heap we saw a whole bunch of writable streams that are backed up and had too much stuff in the queue when the stream was closed.

We need to add logging and defensive checks about stream.write() when it's closed.

Raynos commented 8 years ago

Search for "Writable" or "Duplex"

There were far too many streams in the core. Streams are hard to reason about because they have inbuild queuing, we should get rid of all streams except for streaming mode and TCP sockets.

This means we may have to look at how request() is implemented when you send a arg3 > 65kb.

I want to make sure that we really, really do not create any ArgStream instances unless the user very explicitely opts into streaming.

We should also audit ringpop for whether it opts into streaming.

Raynos commented 8 years ago

Must purge connection when reset

There should never be closed sockets in the core dump.

I don't know why or how we have closed sockets but we had a lot of them. We need to get rid of connections in memory when the socket closes.

Raynos commented 8 years ago

Forcibly clear socket write queue

There looks like there was a lot of junk queued up in socket._writableState.buffer.

We should be naughty and flush the writable stream queue when a socket closes or we resetAll() a connection.

Raynos commented 8 years ago

Remove closed connections from memory

Once again we have closed connections, we never should have them lying around.

Let's make sure they are really really gone and we are not leaking them.

Raynos commented 8 years ago

write helper in hyperbahn for verifying empty operations

We saw an out request operation leak. To gaurd against this we should make all the hyperbahn tests assert that operations on all sockets are empty.

We really also need to write a test for the timed out call request.

Raynos commented 8 years ago

write test in hyperbahn for streaming call requests timing out.

We really also need to write a test for the timed out call request.

What we saw was that making a streaming out request and have the in request time out in the server would leak an out request.

Sounds like a boog.

Raynos commented 8 years ago

here is an example log from production:

{
  "_type": "logs",
  "_id": "AVCXdO8QYKR_7X5_NCJ3",
  "_score": null,
  "_source": {
    "level": "error",
    "msg": "responseAlreadyDone detected!!",
    "fields": "",
    "@version": "1",
    "@timestamp": "2015-10-24T01:27:13.528Z",
    "fields_": {
      "currentState": 3,
      "codeString": "Timeout",
      "errMessage": "request timed out after 1005ms (limit was 1000ms)",
      "callerName": "ringpop",
      "serviceName": "ringpop",
      "listener": false,
      "listeners": 0,
      "_hostname": "myhost.com",
      "_pid": 26408,
      "_processTitle": "my-app-1"
    }
  },
  "sort": [
    1445650033528
  ]
}
weikai77 commented 8 years ago

This happened again today in geospatial. I didn't take core dump, but the heap dump is available at: https://people.uberinternal.com/~kwei/

image

Raynos commented 8 years ago

This happened again today in geospatial.

We took a heapsnapshot.

We are definitely leaking: