If log replication or other exceptions cause the election to fail, the current code logic will reset the Election status to INIT. If this exception cannot be resolved, it will cause continuous election failures, affecting the normal operation of the entire cluster.
Consider adding a delay time when an exception occurs during the election to avoid frequent elections in a short period ?
exception cluster node log
io.aeron.archive.client.ArchiveException: ERROR - requested replay start position=214368015799296 is less than recording start position=214673337090048 for recording 0 at io.aeron.archive.ReplicationSession.hasResponse(ReplicationSession.java:742) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.replay(ReplicationSession.java:576) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.doWork(ReplicationSession.java:220) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.SessionWorker.doWork(SessionWorker.java:64) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ArchiveConductor.doWork(ArchiveConductor.java:303) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.DedicatedModeArchiveConductor.doWork(DedicatedModeArchiveConductor.java:58) ~[aeron-archive-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?]
leader node log
io.aeron.exceptions.AeronException: ERROR - Driver events adapter is invalid at io.aeron.ClientConductor.service(ClientConductor.java:1368) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.doWork(ClientConductor.java:196) ~[aeron-client-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentInvoker.invoke(AgentInvoker.java:147) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.cluster.ConsensusModuleAgent.slowTickWork(ConsensusModuleAgent.java:2114) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:346) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?] Caused by: java.lang.IllegalStateException: unable to keep up with broadcast at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:97) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.service(ClientConductor.java:1349) ~[aeron-client-1.44.1.jar!/:1.44.1] ... 8 more
If log replication or other exceptions cause the election to fail, the current code logic will reset the Election status to INIT. If this exception cannot be resolved, it will cause continuous election failures, affecting the normal operation of the entire cluster.
Consider adding a delay time when an exception occurs during the election to avoid frequent elections in a short period ?
exception cluster node log
io.aeron.archive.client.ArchiveException: ERROR - requested replay start position=214368015799296 is less than recording start position=214673337090048 for recording 0 at io.aeron.archive.ReplicationSession.hasResponse(ReplicationSession.java:742) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.replay(ReplicationSession.java:576) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ReplicationSession.doWork(ReplicationSession.java:220) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.SessionWorker.doWork(SessionWorker.java:64) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.ArchiveConductor.doWork(ArchiveConductor.java:303) ~[aeron-archive-1.44.1.jar!/:1.44.1] at io.aeron.archive.DedicatedModeArchiveConductor.doWork(DedicatedModeArchiveConductor.java:58) ~[aeron-archive-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?]
leader node log
io.aeron.exceptions.AeronException: ERROR - Driver events adapter is invalid at io.aeron.ClientConductor.service(ClientConductor.java:1368) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.doWork(ClientConductor.java:196) ~[aeron-client-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentInvoker.invoke(AgentInvoker.java:147) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.cluster.ConsensusModuleAgent.slowTickWork(ConsensusModuleAgent.java:2114) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:346) ~[aeron-cluster-1.44.1.jar!/:1.44.1] at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296) ~[agrona-1.21.1.jar!/:1.21.1] at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162) ~[agrona-1.21.1.jar!/:1.21.1] at java.base/java.lang.Thread.run(Thread.java:898) [?:?] Caused by: java.lang.IllegalStateException: unable to keep up with broadcast at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:97) ~[agrona-1.21.1.jar!/:1.21.1] at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68) ~[aeron-client-1.44.1.jar!/:1.44.1] at io.aeron.ClientConductor.service(ClientConductor.java:1349) ~[aeron-client-1.44.1.jar!/:1.44.1] ... 8 more