slackhq / astra

Astra is a structured log search and analytics engine developed by Slack and Salesforce
https://slackhq.github.io/astra/
MIT License
209 stars 28 forks source link

ZK session expiry shuts down Astra. Don't make ZK updates in the shutdown path #653

Open vthacker opened 1 year ago

vthacker commented 1 year ago

To Reproduce

I noticed this in our production cluster

16:52:09 - The ZK session has expired CuratorEventImpl
16:52:09 - Curator connection state changed to LOST and Runtime halter is called probably on an unrecoverable error. Stopping the VM.

and the bug is - when ZK is already in a bad state the Astra close protocol should not try to make a ZK fetch like it does today. The ZK operations fails and and the exception ends up calling the Runtime falter again

Another side note, when this happens then in the next 20 seconds we see 450+ log messages with CuratorCache error like this

java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED]
    at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:821)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:457)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.getData(CuratorFrameworkImpl.java:491)
    at org.apache.curator.framework.recipes.cache.CuratorCacheImpl.nodeChanged(CuratorCacheImpl.java:266)
github-actions[bot] commented 1 month ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days.