real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

[Java] Process standby snapshot control signals like normal snapshot signals. #1634

Closed ZachBray closed 2 months ago

ZachBray commented 3 months ago

Previously, the "take a standby snapshot" control signal could be dropped due to back-pressure when appending to the log.

(Particularly when using small term buffer lengths) this behaviour makes it harder to construct tests where a specific number of standby snapshots are taken. The only option is to wait for the standby snapshot to appear (how long?) and retry if it doesn't, i.e., you can guarantee at least one standby snapshot will be taken but not exactly one. In the Aeron Data Retention Regulator (DRR) tests, where we test the DRR behaviour with a certain number of different kinds of (local vs remote) snapshots, it makes it harder to write reliable tests.

This behaviour also means that from the PremiumClusterTool's perspective, the "take a standby snapshot" signal appears to have been processed by the ConsensusModule, but it has actually been dropped, which might make it harder to write operational tooling.

Now, the "take a standby snapshot" control signal behaves like the "take a snapshot" control signal. It "completes", i.e., the control toggle counter is reset, once the action has been appended to the log.

@mikeb01 suspected the behaviour was as it was due to tests getting stuck in the STANDBY_SNAPSHOT toggle state. I noticed something similar when writing DRR tests [0] but it was due to a deadlock-like situation. I have run the Cluster Standby tests several times with this change, and there were no failures.


0: In my case, the root cause was that the behaviour of the default test clustered service is to echo messages on egress indefinitely when back-pressure is experienced. This back-pressure propagated to appending a snapshot action to the log. A test awaited for the snapshot control toggle to complete, but, without polling the client egress, it got stuck in a deadlock-like scenario.