Autoscaling for imageservers

scjody commented 6 years ago

When image-exporter is run via Kubernetes as an imageserver, we attempted to perform autoscaling using the CPU usage metric set to 80%. This didn't work; CPU usage seems too bursty for this to be a useful metric. (Sometimes an image-exporter pod is using very little CPU even though it's busy processing a request.)

As an example, here's the CPU usage of one pod over 24h. The pod was in service on Plotly Cloud production during the first 8 or so hours.

33721369-d26071c6-db34-11e7-816d-3e495cad350b

I'm going to look into other metrics we could use. It might also be OK to use CPU usage but with a lower threshold, like 50%. As a workaround, we could also disable autoscaling temporarily and just run a large number of pods; this wouldn't be any worse than what we currently do for Plotly Cloud prod where we have 12 imageservers running at all times.

@etpinard @monfera @bpostlethwaite FYI

scjody commented 6 years ago

Note: this issue was first noted when we tried to use the new imageservers for Plotly Cloud prod. The problems encountered are discussed starting here: https://github.com/plotly/streambed/issues/9865#issuecomment-349995119

scjody commented 6 years ago

I logged a GCP support case for help figuring out the units in CPU graphs such as this one (where the value shown is just above 1.0 most of the time):

screen shot 2017-12-11 at 16 36 39

Another thing I could try is increasing or decreasing the pod count to see how the graph is affected, since I'm unsure if it's an aggregate or average value.

scjody commented 6 years ago

From GCP support:

The CPU chart on the Workloads page is an aggregate of CPU usage for managed pods. The values are taken from the Stackdriver Monitoring metric container/cpu/usage_time. That metric represents ‘Cumulative CPU usage on all cores in seconds. This number divided by the elapsed time represents usage as a number of cores, regardless of any core limit that might be set’

So for the graph immediately above, the numbers need to be divided by 3 to get the usage per CPU (and multiplied by 100 if you're into percentages). So CPU usage for each pod in that case was 30% to 40%.

The next step here is to increase the request rate and see if there's a consistent CPU usage value beyond which pods become unresponsive.

However if we can get reliable results with 12 imageservers, I'm inclined to go with that for now and deal with this issue after the other imageserver issues. That will let us get it onto prod more quickly, increasing the chances we can confidently ship it with On-Prem 2.3.0.

plotly / orca

Autoscaling for imageservers #42