0.5.0 load test anomaly

mozilla-services / services-engineering

Services engineering core repo - Used for issues/docs/etc that don't obviously belong in another repo.

2 stars 1 forks source link

0.5.0 load test anomaly #63

Open erkolson opened 4 years ago

erkolson commented 4 years ago

Something similar to the "stuck connections" issue we see in production occurred during the 0.5.0 load test. Though, due to the different connection handling in bb8, it was not readily apparent which pod was "stuck"

Connection pools looked like this:

It appears that one pod (...-nr48p) was unable to use all of the idle connections, was very slow to handle requests, and was returning 503s to clients.

Request handling durations:

5xx rate:

After deleting that one pod, performance returned to normal.

pjenvey commented 4 years ago

Considering #64 a duplicate of this: these 50x spikes on 0.5 are due to the timeout issue described there

pjenvey commented 4 years ago

https://github.com/mozilla-services/syncstorage-rs/issues/794 seems to have solved this

pjenvey commented 4 years ago

To elaborate, we were seeing nodes get into these "stuck states" of either not responding entirely, or taking very long to respond. As described in #64 (and https://github.com/mozilla-services/services-engineering/issues/61#issuecomment-669636997), we even saw time outs on endpoints that did not checkout a db connection.

bb8 has potential connection leaks, and worse is its Drop impl. was potentially blocking our event loop, explaining time outs even when no db was involved.

Switching to deadpool from bb8 has fixed the timeouts or "stuck state".

This was a significantly different issue from the "stuck state" we're seeing on prod under 0.4.x.

pjenvey commented 4 years ago

Reopening this, we're seeing similar spikes of 503s due to upstream timeouts on 0.5.8 on production.