Closed erkolson closed 3 years ago
I haven't seen "stuck" pods during 0.5.x load tests on stage but I see something somewhat similar.
Part of the challenge here is stage cluster's size significantly scales up/down for load testing, e.g. from a size of 1-2 nodes when idle -> 5-6 under load, then back down when it's finished.
However the canary node tends to stick around throughout, and a couple of different load tests against 0.5.x show the following:
__lbheartbeat__
or __heartbeat__
)Canary isn't "stuck" here but taking seconds for do nothing health checks.
Zooming out a bit you can see the pattern reflected in the Uptime Check:
Especially easy to see the 4 days of lengthy health checks (23 - 27) when the cluster was mostly idle in between a few load tests.
Thank you Phil, for the details here!
@erkolson:
Spanner latency spikes have been eliminated by dropping the BatchExpiry index but the application is still occasionally experiencing latency spikes that break latency SLA targets.
One example from overnight 7/18-7/19, spanner has normal performance, syncstorage has outlier latency:
This incident caused 2 of the 9 running pods to max out active connections and get stuck. When the pods get in this state, manual intervention is required (to kill them).
The result, increased latency for request handling until the "stuck" pods are deleted:
I'm seeing a number of errors like these at the time but can not tell if they are the cause of the problem:
(I removed the identifying information)