hseeberger / constructr

Coordinated (etcd, ...) cluster construction for dynamic (cloud, containers) environments
Apache License 2.0
212 stars 37 forks source link

Restart FSM process on seed node failures (fixes #168) #173

Closed nick-nachos closed 6 years ago

nick-nachos commented 6 years ago

As discussed in #168, I have implemented the ability to recover from join-timeouts and re-initiate the whole process from scratch. This behavior is configuration driven (see abort-on-join-timeout), and enabled by default. Users who want the previous behavior (i.e. failure of the FSM actor) need to set the abort-on-join-timeout to true.

The original issue occurred when new nodes that tried to join the cluster started the process of joining while all of the registered seed nodes where down; i.e. the GettingNodes state returned a list of seed nodes that where not online at that moment. This led to a join-timeout which in turn led to an FSM failure.

The proposed solution is quite straight-forward: if a join-timeout occurs, go back to the GettingNodes state. Doing so creates a loop that will end in either of the two following cases:

For the first case to work correctly, the FSM was enhanced with the logic that if the cluster member is notified that it joined the cluster (MemberJoined, MemberUp) while being in any of the pre-joining states (BeforeGettingNodes, GettingNodes, Locking), then it will immediately transition to the AddingSelf state. This was required to be done because executing joinSeedNodes after already having joined the cluster has NO effect (akka will ignore the command and log a warning message), which means that the FSM would be stuck in the Joining state waiting for the cluster event that was received prior to that state (until the state timed out, thus triggering an endless loop).

This implementation also gave me the opportunity to actually implement the BeforeGettingNodes - which is depicted in the README file - in order to introduce some latency between the Locking and GettingNodes states, in the cases when a lock request is rejected (returns false). This was particularly useful in order to reduce the number of lock requests and lock acquirement rejection log messages that occurred in a handful of cases (applied in the scope of #168 as well).

I have added a couple of single-jvm integration tests that demonstrate the scenarios this fix applies to.

nick-nachos commented 6 years ago

There is one broken unit-test according to the CI, however it seems to have broken for the scala 2.12 build only, and the failure was never replicated locally (I've run the tests a lot of times). It is also a test that shouldn't be affected by the changed behavior, which makes the failure seem even more bizarre.

@hseeberger is it possible to trigger a re-run of the CI somehow, in case the failure was due to flakiness?

nick-nachos commented 6 years ago

Status update on the broken unit test: I have confirmed that the tests of ConstructrMachineSpec are indeed flaky. All tests include the following code sequence:

val machine = system.actorOf( /* ... */ )
machine ! FSM.SubscribeTransitionCallBack(monitor.ref)

The issue with that is that subscription to FSM events does not work with any sort of replay mode; the subscriber will start receiving the events occurring from the moment of subscription onward. This implies that if transitions of the FSM are very fast, then the subscription might occur after some of them having taken place; thus notifications about the already performed transitions will never be received, which will effectively break the tests.

Locally these tests have never broken, so I introduced a 5 ms delay between the two lines of code mentioned above, i.e.:

val machine = system.actorOf( /* ... */ )
Thread.sleep(5)
machine ! FSM.SubscribeTransitionCallBack(monitor.ref)

After doing so, the first and third test, namely:

consistently fail. This is because of the fact that these two tests get to transition out of the GettingNodes state very quickly, as the getNodes mocked method does not introduce any delay at all in the cases that end up breaking.

Given the above, I conclude that the broken unit test of the Travis build (third one) actually broke due to racing conditions.

dmsquires commented 6 years ago

This pull request is highly desired if it could be merged.

hseeberger commented 6 years ago

The tests have failed. Please fix.

nick-nachos commented 6 years ago

@hseeberger I introduced a small delay at the first calls of the mocked getNodes method of the tests that failed (see my analysis above), and the issue seems to have been fixed.

hseeberger commented 6 years ago

Thanks!

hseeberger commented 6 years ago

Released 0.19.0.