real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

Exception setting cluster time #576

Closed epickrram closed 5 years ago

epickrram commented 5 years ago

Not easily reproducible I'm afraid, but looks as though a schedule timer message was received on the ConsensusModuleAdapter before the time was set by an incoming application/lifecycle message.

java.lang.IllegalStateException: can not set startTime with active timers
  at org.agrona.DeadlineTimerWheel.resetStartTime(DeadlineTimerWheel.java:190)
  at io.aeron.cluster.TimerService.resetStartTime(TimerService.java:66)
  at io.aeron.cluster.ConsensusModuleAgent.clusterTimeMs(ConsensusModuleAgent.java:2349)
  at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:261)
  at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:268)
  at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:161)
  at java.lang.Thread.run(Thread.java:748)

I came across this by chance, so not really sure of what set of steps got the system into this state.

epickrram commented 5 years ago

Could this be a problem with a Timer being hydrated from the snapshot:

https://github.com/real-logic/aeron/blob/master/aeron-cluster/src/main/java/io/aeron/cluster/ConsensusModuleSnapshotLoader.java#L122

Presumably the snapshot needs to be loaded before replay can begin, and it seems as though the replay messages/ConsensusModuleAgent duty cycle are responsible for setting cluster time. If that's the case, then the timer will be added to the TimerService before its current time has been set.

epickrram commented 5 years ago

@JPWatson has pointed out that we're scheduling a timer in ClusteredService.onStart, which I guess isn't the correct time. Will fix that up and keep an eye on the logs.