CosmosFullNode: Rollouts intermittently take down more pods than configured

strangelove-ventures / cosmos-operator

Cosmos Operator is a kubernetes operator for managing cosmos nodes

Apache License 2.0

77 stars 18 forks source link

CosmosFullNode: Rollouts intermittently take down more pods than configured #339

Closed DavidNix closed 10 months ago

DavidNix commented 1 year ago

It's gotten better, but I still see instances where > 1 pod will be deleted when only 1 should be at a time.

I think this happens more on sentries where we've disabled readiness probes. But I've seen it once on deployment where readiness probes were active.

I have yet to find a way to duplicate the issue reliably.

DavidNix commented 1 year ago

You know what, sometimes I think it's the ScheduledVolumeSnapshot taking down the pod. If there's been a problem for a while, ScheduledVolumeSnapshot is pending. As soon as the min number of pods are ready, it quickly deletes one to take the snapshot.