real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

Cluster cannot start if some nodes lost their files and log has been truncated in remaining nodes #1533

Closed pcdv closed 11 months ago

pcdv commented 11 months ago

Let's consider a cluster with 3 nodes and the following scenario:

The cluster correctly restarts if the log hasn't be truncated, i.e. the nodes rebuild their missing state from the leader.

But if before stopping the cluster, a snapshot is generated and the log is truncated to the position of the snapshot, then the cluster fails to start with the following error (in a loop):

io.aeron.archive.client.ArchiveException: ERROR - response for correlationId=4406, error: requested replay start position=0 is less than recording start position=1048576 for recording 0
    at io.aeron.archive.client.AeronArchive.pollForResponse(AeronArchive.java:2301)
    at io.aeron.archive.client.AeronArchive.startReplay(AeronArchive.java:1023)
    at io.aeron.cluster.LogReplay.<init>(LogReplay.java:51)
    at io.aeron.cluster.ConsensusModuleAgent.newLogReplay(ConsensusModuleAgent.java:1581)
    at io.aeron.cluster.Election.followerReplay(Election.java:915)
    at io.aeron.cluster.Election.doWork(Election.java:211)
    at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:343)
    at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304)
    at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296)
    at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162)
    at java.lang.Thread.run(Thread.java:748)

This is reproduced by adding the following test in io.aeron.cluster.StartFromTruncatedRecordingLogTest

    @Test
    @InterruptAfter(30)
    void shouldBeAbleToStartClusterFromTruncatedRecordingLogAndEmptyFollowers()
    {
        cluster = aCluster().withStaticNodes(3).withSegmentFileLength(512 * 1024).start();
        systemTestWatcher.cluster(cluster);

        final TestNode leader = cluster.awaitLeader();
        cluster.connectClient();

        final int messageCount = 1024;
        cluster.sendLargeMessages(messageCount);
        cluster.awaitResponseMessageCount(messageCount);
        cluster.awaitServicesMessageCount(messageCount);

        cluster.takeSnapshot(leader);
        cluster.awaitSnapshotCount(1);
        cluster.purgeLogToLastSnapshot(); // test passes without this

        final int leaderMemberId = leader.index();
        final int followerMemberIdA = cluster.followers().get(0).index();
        final int followerMemberIdB = cluster.followers().get(1).index();

        cluster.stopAllNodes();
        cluster.startStaticNode(leaderMemberId, false);
        cluster.startStaticNode(followerMemberIdA, true);
        cluster.startStaticNode(followerMemberIdB, true);

        assertClusterIsOperational();
    }
mjpt777 commented 11 months ago

This is deliberate corruption of the state and not supported. If this happens you would need to recover from backup or use our warm standby premium feature.

pcdv commented 11 months ago

Not sure I would call that corruption since the state fully exists in one node: it just needs to be replicated to the other nodes that don't have any state at all. It feels like not much is missing to make it work (but I don't know enough the cluster internals to know for sure).

you would need to recover from backup

Actually, this problem occurred while recovering from backup, but it boils down to the test above.

The initial scenario was:

This scenario works only if the ingress log is never truncated.

An alternative would be start 3 ClusterBackup nodes but it seems overkill to replicate the state 3 times.

Is ClusterBackup an actually viable solution or only the premium features allow to have a reliable backup?

mikeb01 commented 11 months ago

Not sure I would call that corruption since the state fully exists in one node

This is where the example breaks down. Raft and other similar consensus algorithms that handle fail-stop type faults can only handle 1 failure in a 3 node cluster. You have a scenario where the 3 node cluster has 2 failed nodes. This is beyond what the algorithm has the ability to correctly and automatically recover from. Therefore you would need to fall back to manually fixing the system.

An alternative would be start 3 ClusterBackup nodes but it seems overkill to replicate the state 3 times.

The premium Cluster Standby would create 3 replicated copies of the data for the scenarios where the user wants to have another cluster that can they can fail over to. It has some functionality to support daisy chain style replication to reduce load on the primary cluster and potential WAN bandwidth consumption. However, we would not see having 3 replicated copies of the state as overkill.

pcdv commented 11 months ago

Thank you for your detailed answer.

This is beyond what the algorithm has the ability to correctly and automatically recover from.

In my initial tests with ClusterBackup, this scenario worked perfectly (until I truncated the log). My mistake was then probably to think it was a supported use-case. There is not much litterature around ClusterBackup.

You have a scenario where the 3 node cluster has 2 failed nodes.

Actually, if I modify the test to have only one failed node (i.e. replace one true by false), it fails just the same.

Therefore you would need to fall back to manually fixing the system.

Yes, I could detect the absence of archive + cluster dirs in one node, wait a bit for other nodes to be ready to start, and automate the download of the state from another node before starting the cluster. Not trivial, but doable. Probably easier to update the fail over procedure so that data is copied manually into the extra nodes :)

we would not see having 3 replicated copies of the state as overkill.

Agree. I only meant that transmitting the same data 3 times over the network would not be optimal.

The premium Cluster Standby sure looks interesting!

mjpt777 commented 11 months ago

As @mikeb01 has pointed out this goes beyond the Raft algorithm. The reason the purge causes issues is that the leader must be log complete under the spec. Also consider without the purge the others nodes have to recovery the whole log. For a long running system this would not be practical as an alternative, even if it works in a simple test.

Snapshots are an optimisation but again there is very little explanation to how they are implemented in the Raft paper or PhD thesis. To purge a old log it needs to be coordinated and you need to know other nodes are up to date so you cannot introduce more than one failure in a 3 node system as Mike points out.

You can use a combination of Cluster Backup and some scripting to make this work. We have to make a living so we provide commercial support and a premium offering that makes this much easier. Many open core offerings do not even provide basic replication or fault tolerance in the open offering. We think we have gone pretty far with with we offer openly given the years of engineering effort that as gone into Aeron.