spring-projects / spring-integration-extensions

The Spring Integration Extensions project provides extension components for Spring Integration
http://www.springintegration.org/
280 stars 266 forks source link

Fix a memory leak in hazelcast LeaderInitiator #236

Closed mleguevel closed 3 years ago

mleguevel commented 3 years ago

According to the hazelcast team: "The logic assumes that locks are generally acquired & released in a fairly short time or hold a very long time without unlocking. But in this case, is a bit different, it holds the lock for a long time but also does lock/unlock very frequently". The previous implementation used the described logic, first acquiring a lock and then doing frequent tryLock/unlock. Doing this leads to com.hazelcast.cp.internal.datastructures.lock.Lock#ownerInvocationRefUids to grow without ever being cleaned thus leading to an OutOfMemory error eventually.

pivotal-issuemaster commented 3 years ago

@mleguevel Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-issuemaster commented 3 years ago

@mleguevel Thank you for signing the Contributor License Agreement!

artembilan commented 3 years ago

Hi, @mleguevel !

Would you mind to explain the fix another way, so it is easy to understand what is going on? And why have you changed the test? Don't we need to hon or that Raft algorithm any more?

Thanks

mleguevel commented 3 years ago

Hello @artembilan,

About the fix: Currently, hazelcast has a bug when doing an initial tryLock on the FencedLock and then doing a lot of tryLock/unlock. In other words, we can't hold a lock for a long time while also calling tryLock and unlock frequently on it. This bug causes an internal map to grow indefinitely and thus leading to an OOM. They have acknowledged that it isn't a normal behavior but they don't expect to fix it before version 4.2 (the current version being 4.0.3). The idea behind this fix is to stop doing the frequent tryLock / unlock. As I said in the bug report, I'm not sure that the algorithm is totally equivalent. I'm not sure about the "else" of the original algorithm in particular:

else {
    this.locked = false;
    // We were not able to acquire it, therefore not leading any more
    handleRevoked();
    if (isRunning()) {
        // Try again quickly in case the lock holder dropped it
        Thread.sleep(LeaderInitiator.this.busyWaitMillis);
    }
}

Because I don't know in which case this is invoked. I was hoping you could help me improve the fix if you think that it's not equivalent :)

About the modifications in the test: As I said in the bug report, I assumed that you initialized 3 HazelcastInstance in the tests to circumvent the fact that hazelcast 3.12.x does not start in unsafe mode with less than 3 nodes. If my assumptions are not correct, I can set the other 2 instances back.

artembilan commented 3 years ago

Merged as https://github.com/spring-projects/spring-integration-extensions/commit/b568662cdbd78f6cda084995a2cf0fd7478971ff.

@mleguevel , thank you for contribution; looking forward for more!