Open erkolson opened 4 years ago
Considering #64 a duplicate of this: these 50x spikes on 0.5 are due to the timeout issue described there
https://github.com/mozilla-services/syncstorage-rs/issues/794 seems to have solved this
To elaborate, we were seeing nodes get into these "stuck states" of either not responding entirely, or taking very long to respond. As described in #64 (and https://github.com/mozilla-services/services-engineering/issues/61#issuecomment-669636997), we even saw time outs on endpoints that did not checkout a db connection.
bb8 has potential connection leaks, and worse is its Drop impl. was potentially blocking our event loop, explaining time outs even when no db was involved.
Switching to deadpool from bb8 has fixed the timeouts or "stuck state".
This was a significantly different issue from the "stuck state" we're seeing on prod under 0.4.x.
Reopening this, we're seeing similar spikes of 503s due to upstream timeouts on 0.5.8 on production.
Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"
Connection pools looked like this:
It appears that one pod (
...-nr48p
) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.Request handling durations:
5xx rate:
After deleting that one pod, performance returned to normal.