mozilla-it / sumo-infra

Infrastructure for support.mozilla.org
1 stars 3 forks source link

Monitor for stuck terminating pods #59

Open ziegeer opened 4 years ago

ziegeer commented 4 years ago

As per the 2020-05-18 Sumo Incident Report, we had pods stuck in a terminating state for ~2 days which isn't right. It's not clear to me how we'd monitor this as the service was running fine and had the desired number of pods but let's try to find a way!