Closed nick-nachos closed 6 years ago
There is one broken unit-test according to the CI, however it seems to have broken for the scala 2.12 build only, and the failure was never replicated locally (I've run the tests a lot of times). It is also a test that shouldn't be affected by the changed behavior, which makes the failure seem even more bizarre.
@hseeberger is it possible to trigger a re-run of the CI somehow, in case the failure was due to flakiness?
Status update on the broken unit test: I have confirmed that the tests of ConstructrMachineSpec
are indeed flaky. All tests include the following code sequence:
val machine = system.actorOf( /* ... */ )
machine ! FSM.SubscribeTransitionCallBack(monitor.ref)
The issue with that is that subscription to FSM
events does not work with any sort of replay mode; the subscriber will start receiving the events occurring from the moment of subscription onward. This implies that if transitions of the FSM
are very fast, then the subscription might occur after some of them having taken place; thus notifications about the already performed transitions will never be received, which will effectively break the tests.
Locally these tests have never broken, so I introduced a 5 ms delay between the two lines of code mentioned above, i.e.:
val machine = system.actorOf( /* ... */ )
Thread.sleep(5)
machine ! FSM.SubscribeTransitionCallBack(monitor.ref)
After doing so, the first and third test, namely:
consistently fail. This is because of the fact that these two tests get to transition out of the GettingNodes
state very quickly, as the getNodes
mocked method does not introduce any delay at all in the cases that end up breaking.
Given the above, I conclude that the broken unit test of the Travis build (third one) actually broke due to racing conditions.
This pull request is highly desired if it could be merged.
The tests have failed. Please fix.
@hseeberger I introduced a small delay at the first calls of the mocked getNodes
method of the tests that failed (see my analysis above), and the issue seems to have been fixed.
Thanks!
Released 0.19.0.
As discussed in #168, I have implemented the ability to recover from join-timeouts and re-initiate the whole process from scratch. This behavior is configuration driven (see
abort-on-join-timeout
), and enabled by default. Users who want the previous behavior (i.e. failure of the FSM actor) need to set theabort-on-join-timeout
to true.The original issue occurred when new nodes that tried to join the cluster started the process of joining while all of the registered seed nodes where down; i.e. the
GettingNodes
state returned a list of seed nodes that where not online at that moment. This led to ajoin-timeout
which in turn led to an FSM failure.The proposed solution is quite straight-forward: if a
join-timeout
occurs, go back to theGettingNodes
state. Doing so creates a loop that will end in either of the two following cases:GettingNodes
eventually returns an empty list (which triggers a "happy path" execution for the new-coming nodes)For the first case to work correctly, the FSM was enhanced with the logic that if the cluster member is notified that it joined the cluster (
MemberJoined
,MemberUp
) while being in any of the pre-joining states (BeforeGettingNodes
,GettingNodes
,Locking
), then it will immediately transition to theAddingSelf
state. This was required to be done because executingjoinSeedNodes
after already having joined the cluster has NO effect (akka will ignore the command and log a warning message), which means that the FSM would be stuck in theJoining
state waiting for the cluster event that was received prior to that state (until the state timed out, thus triggering an endless loop).This implementation also gave me the opportunity to actually implement the
BeforeGettingNodes
- which is depicted in the README file - in order to introduce some latency between theLocking
andGettingNodes
states, in the cases when a lock request is rejected (returns false). This was particularly useful in order to reduce the number of lock requests and lock acquirement rejection log messages that occurred in a handful of cases (applied in the scope of #168 as well).I have added a couple of single-jvm integration tests that demonstrate the scenarios this fix applies to.