rackerlabs / otter

Rackspace Auto Scale
http://www.rackspace.com/cloud/auto-scale/
Other
53 stars 27 forks source link

High convergence gathering time #1960

Open manishtomar opened 7 years ago

manishtomar commented 7 years ago

I am noticing very high gathering time in prod for many groups. In that particular convergence cycle, the individual requests to upstream services and calls to CASS are very fast (<1ms) but the total time taken is sometimes >300s. The only reason I can think why this would occur is due to throttling but that comes into play (say for CLB) when there are many CLBs in one tenant but that is not the case here from looking at the logs. Need to investigate more.

manishtomar commented 7 years ago

This seems to be occurring due to many groups in one tenant. When each of those groups try to converge at same time they cause too many concurrent get requests to CLB which is being throttled. One way to fix is to remove get throttling but I am sure we would've added on CLB team's request. The other solution is to make only one request but return to results all the callers similar to how wait decorator does when authenticating tenant.