Open snarfed opened 1 week ago
Explanation here may be simpler, it may actually just be when the CPU gets pegged and we're suddenly CPU-bound. Here's the last day or so, note the correlation:
Ugh. Well, silver lining is at least that's very understandable and manageable, adding cores and/or optimizing should fix it.
More CPU vs latency correlation. At 1:45p, we bumped router up from 2 cores to 4, with a single WSGI worker with 200 threads. That didn't seem to work well, maybe because of the GIL and context switching? ...so at 2:15p (I think) I switched it to 4 WSGI workers, one per core, with 50 threads each.
The router sometimes gets into a bad state where it takes forever to handle
/queue/receive
requests, eg 30-75s when they should average .5-2s or so. I don't understand what's going on here yet, or why this only happens sometimes.It seems maybe loosely related to the number of WSGI workers and threads per worker, ie it seems worse with one worker w/100 threads, better with five workers w/10 threads each, but only somewhat, and I'm not 100% sure of the correlation.
Maybe it's context switching overhead between threads? But the slowdown seems way too drastic to be caused by that alone. Another theory is that the thread pool gets stuck on tasks that need HTTP requests to external servers that are down or very slow, and either our per-request timeout is too long, or it's ok but we attempt a lot of different outbound requests per task, and so these tasks starve other tasks. That theory feels unsatisfying too, but I don't have any other theories yet. Hrmph.