Memory leak in Hk2ThreadLocal

swenson commented 7 years ago

We (myself and @laazik and others) are tracking down a memory leak, and it seems to be related to the switching out of ThreadLocal for Hk2ThreadLocal in PerLocatorUtilities: https://github.com/hk2-project/hk2/commit/4fa6c59581b5d8dbdf9fbd1da0cd643169734ba9

Hk2ThreadLocal appears to have fewer protections, such as the ThreadLocalMap used in ThreadLocal.

Would it be possible to switch back to ThreadLocal while we work on a more complete Hk2ThreadLocal?

Here is an example heap histogram from one of our crashed instances. You can see the number of WeakHashMap, WeakHashMap.Entry, HashMap.Node, and Long instances are highly correlated and massive.

heap histogram 2016-12-15 11-37-17

I would be happy to work with you on this. Please advise.

jwells131313 commented 7 years ago

As I recall the reason we moved to Hk2ThreadLocal was because it solved at least one class of memory leak with the ability to call removeAll in any thread which would wipe out all threads memory. Do you have a simple test somewhere that demonstrates the memory leak you are seeing?

swenson commented 7 years ago

I don't have a simple test at the moment -- this typically causes our applications to crash after hours of heavy usage.

jwells131313 commented 7 years ago

I assume the gc roots of all of those HashMaps goes back to the Hk2ThreadLocal. Are you using many many threads? I'm just trying to get a handle on the circumstances under which this leak manifest itself. Can you describe why you think this might be happening?

swenson commented 7 years ago

After processing very heavy load on a Jersey server -- I believe many, many, many threads over hours or days are used to serve requests, which accumulate in the Hk2ThreadLocal's hash map.

Since the Hk2ThreadLocal uses a plain HashMap instead of a ThreadLocalMap (which uses weak references, I believe), I would guess that the Longs and values accumulate over time.

jwells131313 commented 7 years ago

I'm not getting it. There would be one or two of these maps which would have a single entry per thread. Unless you actually have 7.7 million threads I am not understanding where these things come from. Or that many ServiceLocators (which is where the WeakHashMaps could be coming from). How many ServiceLocators do you see? Can you like find one and print out it's ID? That would give some information about how many there are.

jwells131313 commented 7 years ago

Also if there are 7.7 million threads... well... that sounds like a Jersey thread leak...

jwells131313 commented 7 years ago

Unless... maybe if there HAVE been 7.7 million threads over time hk2 thread locator might be holding on to dead threads data. I'll at least look into that

swenson commented 7 years ago

I think that there HAVE been 7.7 million threads over time. In one of our services, it takes weeks of high load to produce this behavior. (And reverting back to a version of hk2 with the old ThreadLocal does not exhibit the same memory leak.)

laazik commented 7 years ago

We are running/handling the requests using the async @Suspended AsycnResponse resp model, and the work inside the methods is happening using the completable future chains, which are ran in the default fork-join pool. Resumption of the work resp.resume() is also called from that chain.

To the note of @swenson indeed, the number of requests this service is handling is in the number of tens of millions. We can also repro it under very heavy load in around 4 - 6 hours, I try to re-run the crash test tomorrow to get exact numbers.

Also yes, the heap graphs are night and day between the version where this change was enabled and version where it was not there. We have had now a stable run with previous (non HK2 thread local version) for around a month without restarts and the heap usage is running between 0.36 and 0.46 percent of total.

jwells131313 commented 7 years ago

Yes, this was a bug and a pretty bad memory leak. I just pushed a fix for it, should be available in github shortly. Fix will be in the next release of hk2

jwells131313 commented 7 years ago

Also added a test so closing this issue, but do tell me if it does help with your use case!

swenson commented 7 years ago

Thank you so much!

ChristerF commented 7 years ago

@jwells131313 do you know when the next release will be?

jwells131313 commented 7 years ago

Hk2 version 2.5.0-b31 was released today, it has the memory leak fix. I don't know when Jersey will officially pick it up. It should just be a drop-in replacement for b30 which is what the latest Jersey is using today

javaee / hk2

Memory leak in Hk2ThreadLocal #43