Open ejona86 opened 6 years ago
Oh, and apparently something between 1.6.1 and 1.11.0 could have made this more noticeable for some workloads (as that's how @jpitz hit it). It's unclear what that change could have been.
I guess there's also a question of whether users are benefitting from disabling the thread-local cache for non-netty threads or whether limiting the number of arenas is the major benefit.
One thing to note is that I saw this issue both in our mesos cluster as well as on my local machine (13" osx laptop). So the heap areas didnt actually change that much when testing locally.
I'll still test on Monday to verify my hypothesis, but I'm pretty confident that it's due to the disabling of the cache.
I think another question here is if Im doing something unexpected in our integration causing this problem rather then it being an issue in the core framework. Another thing that can easily be tested.
@ejona86 : verified that as soon as I flipped that boolean back to true, we started leaking memory again:
No change in arena allocation, only in thread cache use.
So this should be trivial to reproduce: create lots of threads and do a single RPC on each. I am questioning whether there is a memory leak, so we should probably check whether the memory usage stabilizes or continues to grow as old threads die and new ones are created.
So to add onto this, after making those changes, I'm still finding my application is running out of memory (just at a much slower rate). After profiling my application locally for several hours, a thread local implementation creeped up to the top of the list. The thread local appears to come through the use of the PooledByteBufAllocator (despite disabling the ThreadLocalCache).
My understanding is that the MessageFramer in the gRPC implementation runs under the provided .executor(Executor)
method. During profiling, I noticed that many of these threads are periodically shutdown and replaced with new threads and new id's. As far as I can tell, we aren't de-allocating that buffer anywhere and so this slowly leaks out over time (depending on what your thread TTL is set to).
To be sure no errors were propagating and then being swallowed, I wrote a quick little thread factory that wrapped the runnable with a try / catch to prevent un-intended thread death.
I'm actually pretty sure the last leak I described above is coming from a Jetty server that runs in the application. Feel free to ignore my last comment.
Scratch that. While my memory continued to grow, it eventually gc'd and properly cleaned up the naming pool that I saw growing unexpectedly.
The only way I was able to solve the small leak above was by swapping over to using the UnpooledByteBufAllocator.DEFAULT implementation.
I didn't see a drop in the number of requests a client issues, but I did see a drop off between the number of requests a client sent and the number of requests a server has processed.
Some years ago Netty changed their default to no longer cache on non-Netty threads. https://github.com/netty/netty/commit/4797b88d89304763fd92eeff73d077effa53685e . gRPC's upgraded to that Netty in 1.48.
gRPC uses ByteBufs from many non-Netty threads for message serialization. When the application has many threads, the per-thread caches can get out of hand.
Start of relevant discussion: https://gitter.im/grpc/grpc?at=5ac698877c3a01610db0f27b User solution: https://gitter.im/grpc/grpc?at=5ac796016d7e07082bc71e0a It references: netty/netty#5930
1758 is related, as it seems allocating the per-thread cache has a noticeable cost. I thought there may be another related issue, but couldn't find it; maybe I'm thinking of that Netty one.
@jpitz had success just outright disabling caching for non-Netty threads. I'm slightly surprised that didn't cause a CPU increase, as that would seem to mean our direct buffers are no longer cached. If our benchmarks show similar behavior, then SGTM.
However, I also wonder if we should swap to heap buffers for message serialization and framing. Especially for small messages. That might hide most of the cost of removing the cache, since for larger messages the cost of direct buffer allocation is likely dwarfed by any copying.
CC @carl-mastrangelo