Closed swenson closed 7 years ago
As I recall the reason we moved to Hk2ThreadLocal was because it solved at least one class of memory leak with the ability to call removeAll in any thread which would wipe out all threads memory. Do you have a simple test somewhere that demonstrates the memory leak you are seeing?
I don't have a simple test at the moment -- this typically causes our applications to crash after hours of heavy usage.
I assume the gc roots of all of those HashMaps goes back to the Hk2ThreadLocal. Are you using many many threads? I'm just trying to get a handle on the circumstances under which this leak manifest itself. Can you describe why you think this might be happening?
After processing very heavy load on a Jersey server -- I believe many, many, many threads over hours or days are used to serve requests, which accumulate in the Hk2ThreadLocal
's hash map.
Since the Hk2ThreadLocal
uses a plain HashMap
instead of a ThreadLocalMap
(which uses weak references, I believe), I would guess that the Longs
and values accumulate over time.
I'm not getting it. There would be one or two of these maps which would have a single entry per thread. Unless you actually have 7.7 million threads I am not understanding where these things come from. Or that many ServiceLocators (which is where the WeakHashMaps could be coming from). How many ServiceLocators do you see? Can you like find one and print out it's ID? That would give some information about how many there are.
Also if there are 7.7 million threads... well... that sounds like a Jersey thread leak...
Unless... maybe if there HAVE been 7.7 million threads over time hk2 thread locator might be holding on to dead threads data. I'll at least look into that
I think that there HAVE been 7.7 million threads over time. In one of our services, it takes weeks of high load to produce this behavior. (And reverting back to a version of hk2 with the old ThreadLocal does not exhibit the same memory leak.)
We are running/handling the requests using the async @Suspended AsycnResponse resp
model, and the work inside the methods is happening using the completable future chains, which are ran in the default fork-join pool. Resumption of the work resp.resume()
is also called from that chain.
To the note of @swenson indeed, the number of requests this service is handling is in the number of tens of millions. We can also repro it under very heavy load in around 4 - 6 hours, I try to re-run the crash test tomorrow to get exact numbers.
Also yes, the heap graphs are night and day between the version where this change was enabled and version where it was not there. We have had now a stable run with previous (non HK2 thread local version) for around a month without restarts and the heap usage is running between 0.36 and 0.46 percent of total.
Yes, this was a bug and a pretty bad memory leak. I just pushed a fix for it, should be available in github shortly. Fix will be in the next release of hk2
Also added a test so closing this issue, but do tell me if it does help with your use case!
Thank you so much!
@jwells131313 do you know when the next release will be?
Hk2 version 2.5.0-b31 was released today, it has the memory leak fix. I don't know when Jersey will officially pick it up. It should just be a drop-in replacement for b30 which is what the latest Jersey is using today
We (myself and @laazik and others) are tracking down a memory leak, and it seems to be related to the switching out of
ThreadLocal
forHk2ThreadLocal
inPerLocatorUtilities
: https://github.com/hk2-project/hk2/commit/4fa6c59581b5d8dbdf9fbd1da0cd643169734ba9Hk2ThreadLocal
appears to have fewer protections, such as theThreadLocalMap
used inThreadLocal
.Would it be possible to switch back to
ThreadLocal
while we work on a more completeHk2ThreadLocal
?Here is an example heap histogram from one of our crashed instances. You can see the number of
WeakHashMap
,WeakHashMap.Entry
,HashMap.Node
, andLong
instances are highly correlated and massive.I would be happy to work with you on this. Please advise.