Autoscaler makes *really* bad decision when exiting "panic" mode

mattmoor commented 5 years ago

In what area(s)?

/area autoscale

What version of Knative?

HEAD

Expected Behavior

I would expect the calm(?) autoscaler to better respect the plans laid by the panic autoscaler.

Actual Behavior

This screenshot paints a grim picture:

We're trying to hit 1000 pods, with only ~40 ready and after 60 seconds we stop panicking and the calm autoscaler decides to almost wholly undo things.

Steps to Reproduce the Problem

Change the istio.yaml gateway's HPA to maxReplicas: 10
Change the scale up rate in config-autoscaler to 1000
Set up config-domain with mattmoor.io
Scaled the Activator to 5 replicas (optional, I saw with and without this)

Deploy the Knative Service:

apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  runLatest:
    configuration:
      revisionTemplate:
        spec:
          container:
            image: gcr.io/knative-samples/autoscale-go:0.1
            # This is set so low to pack as many pods on the cluster as possible
            resources:
              requests:
                cpu: 10m
                memory: 20Mi
              limits:
                cpu: 10m
                memory: 50Mi
          # This allows us to really push the pod counts up.
          containerConcurrency: 1

Wait for the above to hit zero pods, and then run:

docker run --rm williamyeh/wrk -t1000 -c1000 -d20m --timeout 200s http://autoscale-go.default.mattmoor.io/?sleep=1000

I generally see things go over this cliff pretty quickly, so the full 20m duration is likely unnecessary.

/assign @josephburnett @k4leung4

mattmoor commented 5 years ago

/milestone Serving 0.6 Talked to @markusthoemmes and we think that this has been addressed in 0.5. Discussions continue on how to improve the signal (e.g. #3301), but the autoscaler should bias towards overprovisioning now avoiding the traffic loss.

I'm leaving this open and moving to 0.6 to track adding testing for this scenario.

mattmoor commented 5 years ago

/unassign @josephburnett @k4leung4 /assign @hohaichi

mattmoor commented 5 years ago

cc @markusthoemmes

I'm seeing this at HEAD again :(

mattmoor commented 5 years ago

I deployed the autoscale-go sample with containerConcurrency: 80.

After it scales to zero, I launch the load test:

docker run --rm williamyeh/wrk -t1000 -c1000 -d4m --timeout 30s \
   http://autoscale-go.default.35.230.47.31.xip.io?sleep=10

While that ran, in a separate window I had kubectl get deploy --watch going, here's the output (annotated):

autoscale-go-mph9z-deployment   0/0     0            0           8m6s
autoscale-go-mph9z-deployment   0/1     0            0           39m
autoscale-go-mph9z-deployment   0/1     0            0           39m
autoscale-go-mph9z-deployment   0/1     0            0           39m
autoscale-go-mph9z-deployment   0/1     1            0           39m
autoscale-go-mph9z-deployment   0/4     1            0           39m
autoscale-go-mph9z-deployment   0/4     1            0           39m
autoscale-go-mph9z-deployment   0/4     1            0           39m
autoscale-go-mph9z-deployment   0/4     4            0           39m
autoscale-go-mph9z-deployment   0/5     4            0           39m
autoscale-go-mph9z-deployment   0/5     4            0           39m
autoscale-go-mph9z-deployment   0/5     4            0           39m
autoscale-go-mph9z-deployment   0/5     5            0           39m
autoscale-go-mph9z-deployment   1/5     5            1           39m
autoscale-go-mph9z-deployment   2/5     5            2           39m
autoscale-go-mph9z-deployment   3/5     5            3           39m
autoscale-go-mph9z-deployment   4/5     5            4           39m
autoscale-go-mph9z-deployment   5/5     5            5           39m
autoscale-go-mph9z-deployment   5/6     5            5           40m
autoscale-go-mph9z-deployment   5/6     5            5           40m
autoscale-go-mph9z-deployment   5/6     5            5           40m
autoscale-go-mph9z-deployment   5/6     6            5           40m
autoscale-go-mph9z-deployment   6/6     6            6           40m
autoscale-go-mph9z-deployment   6/7     6            6           40m
autoscale-go-mph9z-deployment   6/7     6            6           40m
autoscale-go-mph9z-deployment   6/7     6            6           40m
autoscale-go-mph9z-deployment   6/7     7            6           40m
autoscale-go-mph9z-deployment   7/7     7            7           40m
autoscale-go-mph9z-deployment   7/9     7            7           40m
autoscale-go-mph9z-deployment   7/9     7            7           40m
autoscale-go-mph9z-deployment   7/9     7            7           40m
autoscale-go-mph9z-deployment   7/9     9            7           40m
autoscale-go-mph9z-deployment   8/9     9            8           40m
autoscale-go-mph9z-deployment   9/9     9            9           40m
autoscale-go-mph9z-deployment   9/1     9            9           41m     <-- WTF JUST HAPPENED
autoscale-go-mph9z-deployment   9/1     9            9           41m
autoscale-go-mph9z-deployment   4/1     4            4           41m
autoscale-go-mph9z-deployment   1/1     1            1           41m
autoscale-go-mph9z-deployment   1/3     1            1           41m
autoscale-go-mph9z-deployment   1/3     1            1           41m
autoscale-go-mph9z-deployment   1/3     1            1           41m
autoscale-go-mph9z-deployment   1/3     3            1           41m
autoscale-go-mph9z-deployment   1/4     3            1           41m
autoscale-go-mph9z-deployment   1/4     3            1           41m
autoscale-go-mph9z-deployment   1/4     3            1           41m
autoscale-go-mph9z-deployment   1/4     4            1           41m
autoscale-go-mph9z-deployment   2/4     4            2           41m
autoscale-go-mph9z-deployment   3/4     4            3           41m
autoscale-go-mph9z-deployment   4/4     4            4           41m
autoscale-go-mph9z-deployment   4/7     4            4           41m
autoscale-go-mph9z-deployment   4/7     4            4           41m
autoscale-go-mph9z-deployment   4/7     4            4           41m
autoscale-go-mph9z-deployment   4/7     7            4           41m
autoscale-go-mph9z-deployment   5/7     7            5           41m
autoscale-go-mph9z-deployment   6/7     7            6           41m
autoscale-go-mph9z-deployment   6/8     7            6           41m
autoscale-go-mph9z-deployment   6/8     7            6           41m
autoscale-go-mph9z-deployment   6/8     7            6           41m
autoscale-go-mph9z-deployment   7/8     7            7           41m
autoscale-go-mph9z-deployment   7/8     8            7           41m
autoscale-go-mph9z-deployment   8/8     8            8           41m
autoscale-go-mph9z-deployment   8/18    8            8           42m
autoscale-go-mph9z-deployment   8/18    8            8           42m
autoscale-go-mph9z-deployment   8/18    8            8           42m
autoscale-go-mph9z-deployment   8/18    18           8           42m
autoscale-go-mph9z-deployment   9/18    18           9           42m
autoscale-go-mph9z-deployment   10/18   18           10          42m
autoscale-go-mph9z-deployment   11/18   18           11          42m
autoscale-go-mph9z-deployment   12/18   18           12          42m
autoscale-go-mph9z-deployment   13/18   18           13          42m
autoscale-go-mph9z-deployment   14/18   18           14          42m
autoscale-go-mph9z-deployment   15/18   18           15          42m
autoscale-go-mph9z-deployment   16/18   18           16          42m
autoscale-go-mph9z-deployment   17/18   18           17          42m
autoscale-go-mph9z-deployment   18/18   18           18          42m
autoscale-go-mph9z-deployment   18/1    18           18          43m   <-- Load test ends
autoscale-go-mph9z-deployment   18/1    18           18          43m
autoscale-go-mph9z-deployment   1/1     1            1           43m

mattmoor commented 5 years ago

Here's a theory: are we sure that the activator Pod(s) are represented in each time bucket?

https://github.com/knative/serving/blob/95e6390a1153ebb8219b1c9b145c9b1239116d10/pkg/autoscaler/autoscaler.go#L80

mattmoor commented 5 years ago

It looks like the activator reports every second in theory, and that we truncate buckets at a 2s window, so we SHOULD always see the activator pods.

mattmoor commented 5 years ago

Ok, I instrumented this a bit and it looks like the problem may actually be the way we aggregate stats from the scraper. The stats from the activator pods look perfectly fine, but the ones from the scraper all have the same PodName. This means that all of the stats coming from the Pods get lumped together and they are all averaged to determine concurrency instead of being summed 😲 .

This means the oscillation we see is because the only times we get an accurate concurrency read is when the stats are coming from the activator or a single pod. Those prompt a scale up to the appropriate scale (and panic); at the appropriate scale we have the desired concurrency for each pod, which given the above bug means that when we exit panic mode we always think we need exactly one pod.

mattmoor commented 5 years ago

Here's what effectively turns our sum into an average: https://github.com/knative/serving/blob/95e6390a1153ebb8219b1c9b145c9b1239116d10/pkg/autoscaler/autoscaler.go#L90

mattmoor commented 5 years ago

I see that we try to compensate for this here: https://github.com/knative/serving/blob/95e6390a1153ebb8219b1c9b145c9b1239116d10/pkg/autoscaler/stats_scraper.go#L147-L150 (not sure why it isn't working)

mattmoor commented 5 years ago

Ok, so at least part of this is that wrk is ignoring my querystring. Going to try with hey now. 🤦‍♂

mattmoor commented 5 years ago

With hey things look much better. At 100ms we jump to 7 replicas and stay there for the duration. At 10ms we jump to 7 replicas and after a while drop to 3-4 replicas (with some jitter).

I think we still have accounting problems when we're dealing with extraordinarily fast user-container requests, but this isn't as pervasive a problem as I had first imagined. Too bad I wasted so much time with wrk :(

markusthoemmes commented 5 years ago

@mattmoor good observations, thanks.

The autoscaler currently works off of "system total" values for concurrency, meaning: It handles the values it gets as them being the total of the system.

The scraper doesn't care about which pod is reporting metrics, so it extrapolates the values it reads to be representative of the whole system. As we're doing sampling, some form of extrapolation will need to happen at some point.

mattmoor commented 5 years ago

@markusthoemmes I just wonder if we'd have things this way if we were starting from scratch? As a name PodName is very misleading in this context.

hohaichi commented 5 years ago

I think @yanweiguo is making scraping more accurate: https://github.com/knative/serving/pull/3831.

I also think that scaling up and scaling down should not be treated the same (as we do now). Scaling up is to address reliability and it should be instantaneous. Scaling down is to save costs, and it could be delayed or done gradually. What do you think?

mdemirhan commented 5 years ago

The remaining work here is to add tests to make sure that this won't regress in the future, right?

hohaichi commented 5 years ago

Yes, adding tests is my plan for this issue. I think the tests may be flaky though, as it is the nature of sampling. The tests will help to assess how flaky it is and whether or not we need to adjust the scaling algorithm.

knative / serving