hazelcast / hazelcast

Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
https://www.hazelcast.com
Other
6.08k stars 1.83k forks source link

[lock] lease lock not released under cluster shutdown test, 3.8-SANPSHOT #9256

Closed Danny-Hazelcast closed 7 years ago

Danny-Hazelcast commented 7 years ago

nightly builds

https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/ https://hazelcast-l337.ci.cloudbees.com/view/rollup/job/rollup-lease-lock/

some times fail, and the lease locks are not released. members are being shutdown and restarted.

https://hazelcast-l337.ci.cloudbees.com/view/rollup/job/rollup-lease-lock/8/console http://54.87.52.100/~jenkins/workspace/rollup-lease-lock/3.7.3/2016_11_15-03_08_36

HzMember1HZ validate hz.lock.validate.LeaseLockAssert threadId=0 global.AssertionException: lease-lock292.isLocked()==true 

however this fail is rare,

pveentjer commented 7 years ago

Can you provide a better title because the current title doesn't provide any insights what the problem is

jerrinot commented 7 years ago

@Danny-Hazelcast can you link the test scenario please?

mdogan commented 7 years ago

Base class: https://github.com/Danny-Hazelcast/hzCmd-bench/blob/master/src/main/java/hz/lock/validate/base/LockValidate.java

Lease Lock: https://github.com/Danny-Hazelcast/hzCmd-bench/blob/master/src/main/java/hz/lock/validate/LeaseLock.java

Assertion: https://github.com/Danny-Hazelcast/hzCmd-bench/blob/master/src/main/java/hz/lock/validate/LockAssert.java

Main timestep function is:

    public void timeStep() {

        int i = random.nextInt(count);
        ILock lock = getLock(i);

        lock.lock(seconds, TimeUnit.SECONDS);
        incMapIdx(i);
        increments[i]++;
    }

Assertion function is:

    public void timeStep() throws AssertionException {
        ...
        for (int i=0; i<count; i++) {
            if( getIncrementFor(i) != getMapIdx(i) ){
                throw new AssertionException(getInfoFor(i));
            }
        }

        for (int i=0; i<count; i++) {
            ILock lock = getLock(i);
            if(lock.isLocked()){
                throw new AssertionException(lock.getName()+" isLocked()==true");
            }
        }
        setRunning(false);
    }

As far as I see, incMapIdx(i) can be called concurrently, if locking thread is de-scheduled long enough to lock lease expire.

    public void incMapIdx(int i){
        Integer val = lockedMap.get(i);
        if(val==null){
            val=new Integer(0);
        }
        utils.Utils.sleep(sleepMs);
        lockedMap.put(i, ++val);
    }

Since incMapIdx(int i) function is not atomic, threads calling that may overwrite others' increments.

To me, best way to test lock lease is to show progress without calling unlock ever. So, each thread calling lock.lock(seconds, TimeUnit.SECONDS) should progress, no one gets stuck ever.

Danny-Hazelcast commented 7 years ago

Last Fails of the lease lock shutdown test

1) https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/120/consoleFull

HzClient4HZ leaselock hz.lock.validate.LeaseLock threadId=9 com.hazelcast.spi.exception.TargetDisconnectedException: Connection closed by the other side

HzMember1HZ validate hz.lock.validate.LeaseLockAssert threadId=0 global.AssertionException: atomic[lease-lock0=0] map[lease-lock.get(0)=72] 

2) https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/121/consoleFull

HzClient2HZ leaselock hz.lock.validate.LeaseLock threadId=1 com.hazelcast.spi.exception.TargetDisconnectedException: Connection closed by the other side 
05:12:21 HzClient2HZ leaselock hz.lock.validate.LeaseLock threadId=10 com.hazelcast.spi.exception.TargetDisconnectedException: Connection closed by the other side 

HzMember1HZ validate hz.lock.validate.LeaseLockAssert threadId=0 global.AssertionException: atomic[lease-lock0=0] map[lease-lock.get(0)=82] 

3) https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/122/consoleFull

HzMember1HZ validate hz.lock.validate.LeaseLockAssert threadId=0 global.AssertionException: atomic[lease-lock18=92] map[lease-lock.get(18)=91] 

4) https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/123/consoleFull

HzMember1HZ validate hz.lock.validate.LeaseLockAssert threadId=0 global.AssertionException: atomic[lease-lock8=177] map[lease-lock.get(8)=176] 
Danny-Hazelcast commented 7 years ago

yes i agree that,

lock.lock(seconds, TimeUnit.SECONDS);
incMapIdx(i);

is a race, and could be causing the assert issues

but the next question would be, what particle use is there for the lease lock feature, if after calling

lock.lock(seconds, TimeUnit.SECONDS);

you can not gantry that you have the lock.

Danny-Hazelcast commented 7 years ago

the 2 options to change the test are

1) remove validation of concurrent assess

2) increase the lock.lock(seconds, TimeUnit.SECONDS); from 10 seconds to a large number.

Danny-Hazelcast commented 7 years ago

also i note that the original issue was about, the fact that the lease lock remained locked in the validation stage, (which has not been reproduced recently)

and not about the concurrent access issue.

mmedenjak commented 7 years ago

@Danny-Hazelcast I believe the rollup test hangs on wait and that it is just waiting for the bench to finish. But the bench could be stuck because because there is some lock is never released. The assertion is after the wait so we never get any exception.

Can you modify the test so that you have just one run with one HZ and client version, don't wait but rather stop the bench somehow and sleep for more than the lease duration (e.g. if the lease is 10 sec, sleep for 30 sec)?

mmedenjak commented 7 years ago

As for the test, you can modify it so that the increments for a specific value of i are under a java lock - so that the increments never overwrite each other. Perhaps something like this :

public class LeaseLock extends LockValidate {

    public long seconds = 10;
    private int[] increments;
    private Lock[] incrementUpdateLocks;

    public void init() throws Exception {
        super.init();
        increments = new int[count];
        incrementUpdateLocks = new Lock[count];

        for (int i = 0; i < incrementUpdateLocks.length; i++) {
            incrementUpdateLocks[i] = new ReentrantLock();
        }
    }

    public void timeStep() {
        int i = random.nextInt(count);
        ILock lock = getLock(i);
        lock.lock(seconds, TimeUnit.SECONDS);
        try{
            incrementUpdateLocks[i].lock();
            incMapIdx(i);
            increments[i]++;
        } finally {
            incrementUpdateLocks[i].unlock();
        }
    }

    public void postPhase() {
        for (int i = 0; i < count; i++) {
            System.out.println("increments[" + i + "]=" + increments[i]);
            addIncrementFor(i, increments[i]);
        }
    }
}
mmedenjak commented 7 years ago

And the lock lease method is still useful even though you cannot guarantee that you hold the lock. It ensures fairness and that you don't get deadlocks since the lock should always get unlocked even though the thread is stuck/not responding/taking too long.

mmedenjak commented 7 years ago

@Danny-Hazelcast The https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/ test seems to be passing so I suggest just to make some changes to the https://hazelcast-l337.ci.cloudbees.com/view/rollup/job/rollup-lease-lock/ test. It currently goes through these versions : 3.7 3.7.1 3.7.2 3.7.3 3.7.4. I propose to :

mmedenjak commented 7 years ago

Moved to 3.9 since the fix is not trivial and might introduce unforeseen regressions. Once tested we can backport to 3.8.2.

Danny-Hazelcast commented 7 years ago

another instance of this fail 3.9-SANPSHOT

05:49:03  leaselock HzClient2HZ timeout=3900 running bench
05:49:03  leaselock HzClient1HZ timeout=3900 running bench
05:49:03  leaselock HzClient4HZ timeout=3900 running bench
05:49:03  leaselock HzClient3HZ timeout=3900 running bench
05:49:05 validate@class=hzcmd.lock.validate.UnLockedAssert [driver=Member, threads=1, warmup=0, duration=120, bench=Metrics, interval=0, recordException=true, count=500, name=lease-lock]
05:49:36 HzMember1HZ validate hzcmd.lock.validate.UnLockedAssert threadId=0 global.AssertionException: lease-lock156.isLocked()==true 

https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/206/console

Danny-Hazelcast commented 7 years ago

3.8.1 failed with this issue

https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/215/console

/disk1/jenkins/workspace/shutdown-lease-lock/3.8.1/2017_03_26-04_42_05 04:42:05 http://54.87.52.100/~jenkins/workspace/shutdown-lease-lock/3.8.1/2017_03_26-04_42_05

HzMember1HZ validate hzcmd.lock.validate.UnLockedAssert threadId=0 global.AssertionException: lease-lock45.isLocked()==true 

stack trace available at

05:49:37 output/HZ/HzClient3HZ/timeout_stack.txt
05:49:37 output/HZ/HzMember2HZ/exception.txt
05:49:37 output/HZ/HzMember2HZ/timeout_stack.txt
05:49:37 output/HZ/HzMember4HZ/exception.txt
05:49:37 output/HZ/HzMember4HZ/timeout_stack.txt
05:49:37 output/HZ/HzClient4HZ/timeout_stack.txt
05:49:37 output/HZ/HzMember1HZ/exception.txt
05:49:37 output/HZ/HzMember1HZ/timeout_stack.txt
05:49:37 output/HZ/HzMember3HZ/exception.txt
05:49:37 output/HZ/HzMember3HZ/timeout_stack.txt
05:49:37 output/HZ/HzClient2HZ/timeout_stack.txt
05:49:37 output/HZ/HzClient1HZ/timeout_stack.txt
mmedenjak commented 7 years ago

@Danny-Hazelcast the fix has been merged, you can retest this scenario

Danny-Hazelcast commented 7 years ago

@mmedenjak i ran this test a few times https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/

and 2 fails out of 5

mmedenjak commented 7 years ago

@Danny-Hazelcast It uses 3.9-SNAPSHOT from the maven repo? We then need to wait for the fix to be released or you can run on master.

Danny-Hazelcast commented 7 years ago

not using jar from maven repo.

this is the version used INFO: [10.0.0.167]:5701 [HZ] [3.9-SNAPSHOT] Hazelcast Enterprise 3.9-SNAPSHOT (20170808 - 890590d, 63442be) starting at [10.0.0.167]:5701

mmedenjak commented 7 years ago

@Danny-Hazelcast The first hash is from EE and the second hash is from OS, right? So it seems that you are in fact working with hazelcast 3.9-SNAPSHOT as provided from the snapshot repository : https://repository-hazelcast-l337.forge.cloudbees.com/snapshot/com/hazelcast/hazelcast-enterprise-all/3.9-SNAPSHOT/ The commit hash corresponds to a commit from yesterday :

commit 63442be1ac9992483a6f675ebec0a81d4c1d87c3
Author: donnerbart <github@thunderphreak.de>
Date:   Tue Aug 8 14:51:10 2017 +0200

    Cleanup of AbstractNearCacheBasicTest

While my commit was today:

commit 4cd1fa886ece06f34639db17f24b4b47884c4877
Merge: e3c4a18 a8b2df6
Author: Matko Medenjak <medenjak.matko@gmail.com>
Date:   Wed Aug 9 11:13:33 2017 +0200

    Merge pull request #9875 from mmedenjak/unlock-lease-retry

    Retry UnlockIfLeaseExpiredOperation if exception is retryable

So we can wait until there is a new 3.9-SNAPSHOT jar released in the snapshot repository, including my commit, or you can run with the master branch.

Danny-Hazelcast commented 7 years ago

ok so the jar did not contain your fix.

we are not using the jar from maven repo, the jar is built on local box.

Danny-Hazelcast commented 7 years ago

@mmedenjak i ran this test a few times https://hazelcast-l337.ci.cloudbees.com/view/shutdown/job/shutdown-lease-lock/ passed 6 out of 6 using [3.9-SNAPSHOT] Hazelcast Enterprise 3.9-SNAPSHOT (20170810 - b6bcd1d, 311a5e75) starting

so looks like its fixed