Hazelcast LeaderInitiator inconsistent yielding

spring-projects / spring-integration-extensions

The Spring Integration Extensions project provides extension components for Spring Integration

http://www.springintegration.org/

280 stars 266 forks source link

Hazelcast LeaderInitiator inconsistent yielding #239

Closed al81-ru closed 3 years ago

al81-ru commented 3 years ago

Sometimes (more often than not) after yielding a role the same node gets promoted again. Also there are warning logs about InterruptedException error - is that expected behaviour?

sample code for reproduction

sample output

artembilan commented 3 years ago

Thanks for reporting this.

We probably should suppress that java.lang.InterruptedException: sleep interrupted because it is normal behavior when we yield, so we just really interrupt the current active thread.

The behavior with obtaining a leadership back is really an expected behavior, although we might think about some sleep just after yield to let other nodes to take it from us.

Feel free to open Pull Request with some possible solution!

artembilan commented 3 years ago

Well, actually the logic there is like this:

if (isRunning()) {
                                logger.warn("Restarting LeaderSelector for " + this.context + " because of error.", e);
                                LeaderInitiator.this.future =
                                        LeaderInitiator.this.executorService.submit(
                                                () -> {
                                                    // Give it a chance to elect some other leader.
                                                    Thread.sleep(LeaderInitiator.this.busyWaitMillis);
                                                    return LeaderSelector.this.call();
                                                });
                            }

So, we indeed have that Thread.sleep(LeaderInitiator.this.busyWaitMillis); in a new initiated task before we really start selecting from the currently "broken" node. Or that this.busyWaitMillis is too small or you are so unlucky that the same node is selected back because the other one is so slow to take a lock for leadership...

al81-ru commented 3 years ago

I wonder if ICountDownLatch can provide a solution: after yielding set up countdown and await with configured timeout, until other node acquired leadership and finished countdown

artembilan commented 3 years ago

until other node acquired leadership

Now imaging that we don't have other nodes at all, well some transient state when we lost them. So, who is going to fulfill that ICountDownLatch for us? I we still need to be able to re-enter the leadership even if we have just yield.

al81-ru commented 3 years ago

Yes, there is no point in this approach, increasing busyWaitMillis is enough. And for more complicated user logic it's better to use stop() and start()