Open markyen opened 9 years ago
Add a special case to operations._sweepOps
to deal with out requests and purge anything where now - req.start < configurable sanity sweep ttl
The sanity sweep ttl should probably 30s; if any out req is longer then 30s get rid of it from memory.
This has to be configurable for hyperbahn and might be 2minutes; 5minutes; 10minutes; ???
On out_request
, streaming_out_request
, request.js
and v2/out_request
we want to ensure that the arg3
and arg2
fields are removed.
Holding on to buffers leaks slow buffers.
Leaking OutRequest should not leak buffers. At some point we should do the same for InRequest, InResponse and OutResponse but those will be a lot harder.
It's not save to just remove arg3
and arg2
. We will have to audit the entire path from req.send()
to conn.socket.write()
to make sure we do not leak the buffers.
There might be a weird race condition where:
I have no idea if this is true or what. But we need to verify that out of order error frames and continue frames do not leak a request in an operations data structure.
In the heap we saw a whole bunch of writable streams that are backed up and had too much stuff in the queue when the stream was closed.
We need to add logging and defensive checks about stream.write()
when it's closed.
There were far too many streams in the core. Streams are hard to reason about because they have inbuild queuing, we should get rid of all streams except for streaming mode and TCP sockets.
This means we may have to look at how request() is implemented when you send a arg3 > 65kb.
I want to make sure that we really, really do not create any ArgStream instances unless the user very explicitely opts into streaming.
We should also audit ringpop for whether it opts into streaming.
There should never be closed sockets in the core dump.
I don't know why or how we have closed sockets but we had a lot of them. We need to get rid of connections in memory when the socket closes.
There looks like there was a lot of junk queued up in socket._writableState.buffer.
We should be naughty and flush the writable stream queue when a socket closes or we resetAll() a connection.
Once again we have closed connections, we never should have them lying around.
Let's make sure they are really really gone and we are not leaking them.
We saw an out request operation leak. To gaurd against this we should make all the hyperbahn tests assert that operations on all sockets are empty.
We really also need to write a test for the timed out call request.
We really also need to write a test for the timed out call request.
What we saw was that making a streaming out request and have the in request time out in the server would leak an out request.
Sounds like a boog.
here is an example log from production:
{
"_type": "logs",
"_id": "AVCXdO8QYKR_7X5_NCJ3",
"_score": null,
"_source": {
"level": "error",
"msg": "responseAlreadyDone detected!!",
"fields": "",
"@version": "1",
"@timestamp": "2015-10-24T01:27:13.528Z",
"fields_": {
"currentState": 3,
"codeString": "Timeout",
"errMessage": "request timed out after 1005ms (limit was 1000ms)",
"callerName": "ringpop",
"serviceName": "ringpop",
"listener": false,
"listeners": 0,
"_hostname": "myhost.com",
"_pid": 26408,
"_processTitle": "my-app-1"
}
},
"sort": [
1445650033528
]
}
This happened again today in geospatial. I didn't take core dump, but the heap dump is available at: https://people.uberinternal.com/~kwei/
This happened again today in geospatial.
We took a heapsnapshot.
We are definitely leaking:
After the error in uber/hyperbahn#22 occurs, the memory on several workers spike to several gigabytes:
A summary of core dumps taken from affected and normal workers are here: https://gist.github.com/markyen/67286b06b68d1b0b7685
TODO:
/cc @Raynos @ShanniLi