Stalled burnchain download on a valid block

wileyj commented 8 months ago

WARN [1702606757.464505] [src/chainstate/coordinator/mod.rs:2211] [chains-coordinator-0.0.0.0:20443] ChainsCoordinator: could not retrieve  block burnhash=0000000000000000000132c46480ad55396584035a721ce428d88d5bd4223642
WARN [1702606757.464549] [src/chainstate/coordinator/mod.rs:434] [chains-coordinator-0.0.0.0:20443] Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(0000000000000000000132c46480ad55396584035a721ce428d88d5bd4223642))

these messages scroll until the node is shutdown. block in question: https://mempool.space/block/0000000000000000000132c46480ad55396584035a721ce428d88d5bd4223642

the burnchain host is at chain tip, and there is a second stacks node running using the same burnchain host which is at stacks chain tip.

suspicions: this instance is running as a k8s pod, and i strongly suspect the pod was moved to another VM as part of the k8s scheduler rebalancing a VM's workload. i also don't think the container is being shutdown gracefully (essentially kill -9 when the pod is moved).

cc @CharlieC3 as the k8s expert

wileyj commented 8 months ago

in the meantime, i've recommended they look into Hiro's helm chart to try and mitigate.

CharlieC3 commented 8 months ago

I've seen this issue in the past, albeit infrequently. When it's happened it was usually to a Stacks node that was configured to emit events to a Stacks Blockchain API, and both or one of these services underwent a restart. It's not consistent, and I don't recall if it was resolved by an additional restart or by restoring the chainstate from a backup.

Either way, your suspicion about whether this could happen from a non-graceful shutdown is highly possible; it's important to configure a reasonable [terminationGracePeriodSeconds](https://github.com/hirosystems/charts/blob/main/hirosystems/stacks-blockchain/values.yaml#L355) for the pod if the default (30 seconds) is not enough.

Also if the pod is running on a pre-emptible or spot VM, it may not have enough time to shut down properly when the node is reclaimed by the cloud provider, in which case it would be force-killed. So it's important stacks core nodes run on on-demand VMs when possible.

wileyj commented 8 months ago

thanks @CharlieC3 - not sure if this is anything actionable on the source code, it seems more like a deployment issue. leaving it open until a core eng can chime in.

stacks-network / stacks-core

Stalled burnchain download on a valid block #4180