real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
https://aeron.io
Apache License 2.0
7.42k stars 892 forks source link

[cluster] The replication log exception results in an unlimited retry election #1657

Open WorkingChen opened 2 months ago

WorkingChen commented 2 months ago

If log replication or other exceptions cause the election to fail, the current code logic will reset the Election status to INIT. If this exception cannot be resolved, it will cause continuous election failures, affecting the normal operation of the entire cluster.

Consider adding a delay time when an exception occurs during the election to avoid frequent elections in a short period ?

exception cluster node log io.aeron.archive.client.ArchiveException: ERROR - requested replay start position=214368015799296 is less than recording start position=214673337090048 for recording 0 at io.aeron.archive.ReplicationSession.hasResponse(ReplicationSession.java:742) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.replay(ReplicationSession.java:576) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.doWork(ReplicationSession.java:220) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.SessionWorker.doWork(SessionWorker.java:64) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ArchiveConductor.doWork(ArchiveConductor.java:303) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.DedicatedModeArchiveConductor.doWork(DedicatedModeArchiveConductor.java:58) ~[aeron-archive-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?]

leader node log io.aeron.exceptions.AeronException: ERROR - Driver events adapter is invalid at io.aeron.ClientConductor.service(ClientConductor.java:1368) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.doWork(ClientConductor.java:196) ~[aeron-client-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentInvoker.invoke(AgentInvoker.java:147) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.cluster.ConsensusModuleAgent.slowTickWork(ConsensusModuleAgent.java:2114) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:346) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?] Caused by: java.lang.IllegalStateException: unable to keep up with broadcast at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:97) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.service(ClientConductor.java:1349) ~[aeron-client-1.44.1.jar!/:1.44.1] ... 8 more