16:52:09 - The ZK session has expired CuratorEventImpl
16:52:09 - Curator connection state changed to LOST and Runtime halter is called probably on an unrecoverable error. Stopping the VM.
and the bug is - when ZK is already in a bad state the Astra close protocol should not try to make a ZK fetch like it does today. The ZK operations fails and and the exception ends up calling the Runtime falter again
Another side note, when this happens then in the next 20 seconds we see 450+ log messages with CuratorCache error like this
java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED]
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:821)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:457)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.getData(CuratorFrameworkImpl.java:491)
at org.apache.curator.framework.recipes.cache.CuratorCacheImpl.nodeChanged(CuratorCacheImpl.java:266)
To Reproduce
I noticed this in our production cluster
and the bug is - when ZK is already in a bad state the Astra close protocol should not try to make a ZK fetch like it does today. The ZK operations fails and and the exception ends up calling the Runtime falter again
Another side note, when this happens then in the next 20 seconds we see 450+ log messages with CuratorCache error like this