Closed mattmoor closed 5 years ago
/milestone Serving 0.6 Talked to @markusthoemmes and we think that this has been addressed in 0.5. Discussions continue on how to improve the signal (e.g. #3301), but the autoscaler should bias towards overprovisioning now avoiding the traffic loss.
I'm leaving this open and moving to 0.6 to track adding testing for this scenario.
/unassign @josephburnett @k4leung4 /assign @hohaichi
cc @markusthoemmes
I'm seeing this at HEAD again :(
I deployed the autoscale-go
sample with containerConcurrency: 80
.
After it scales to zero, I launch the load test:
docker run --rm williamyeh/wrk -t1000 -c1000 -d4m --timeout 30s \
http://autoscale-go.default.35.230.47.31.xip.io?sleep=10
While that ran, in a separate window I had kubectl get deploy --watch
going, here's the output (annotated):
autoscale-go-mph9z-deployment 0/0 0 0 8m6s
autoscale-go-mph9z-deployment 0/1 0 0 39m
autoscale-go-mph9z-deployment 0/1 0 0 39m
autoscale-go-mph9z-deployment 0/1 0 0 39m
autoscale-go-mph9z-deployment 0/1 1 0 39m
autoscale-go-mph9z-deployment 0/4 1 0 39m
autoscale-go-mph9z-deployment 0/4 1 0 39m
autoscale-go-mph9z-deployment 0/4 1 0 39m
autoscale-go-mph9z-deployment 0/4 4 0 39m
autoscale-go-mph9z-deployment 0/5 4 0 39m
autoscale-go-mph9z-deployment 0/5 4 0 39m
autoscale-go-mph9z-deployment 0/5 4 0 39m
autoscale-go-mph9z-deployment 0/5 5 0 39m
autoscale-go-mph9z-deployment 1/5 5 1 39m
autoscale-go-mph9z-deployment 2/5 5 2 39m
autoscale-go-mph9z-deployment 3/5 5 3 39m
autoscale-go-mph9z-deployment 4/5 5 4 39m
autoscale-go-mph9z-deployment 5/5 5 5 39m
autoscale-go-mph9z-deployment 5/6 5 5 40m
autoscale-go-mph9z-deployment 5/6 5 5 40m
autoscale-go-mph9z-deployment 5/6 5 5 40m
autoscale-go-mph9z-deployment 5/6 6 5 40m
autoscale-go-mph9z-deployment 6/6 6 6 40m
autoscale-go-mph9z-deployment 6/7 6 6 40m
autoscale-go-mph9z-deployment 6/7 6 6 40m
autoscale-go-mph9z-deployment 6/7 6 6 40m
autoscale-go-mph9z-deployment 6/7 7 6 40m
autoscale-go-mph9z-deployment 7/7 7 7 40m
autoscale-go-mph9z-deployment 7/9 7 7 40m
autoscale-go-mph9z-deployment 7/9 7 7 40m
autoscale-go-mph9z-deployment 7/9 7 7 40m
autoscale-go-mph9z-deployment 7/9 9 7 40m
autoscale-go-mph9z-deployment 8/9 9 8 40m
autoscale-go-mph9z-deployment 9/9 9 9 40m
autoscale-go-mph9z-deployment 9/1 9 9 41m <-- WTF JUST HAPPENED
autoscale-go-mph9z-deployment 9/1 9 9 41m
autoscale-go-mph9z-deployment 4/1 4 4 41m
autoscale-go-mph9z-deployment 1/1 1 1 41m
autoscale-go-mph9z-deployment 1/3 1 1 41m
autoscale-go-mph9z-deployment 1/3 1 1 41m
autoscale-go-mph9z-deployment 1/3 1 1 41m
autoscale-go-mph9z-deployment 1/3 3 1 41m
autoscale-go-mph9z-deployment 1/4 3 1 41m
autoscale-go-mph9z-deployment 1/4 3 1 41m
autoscale-go-mph9z-deployment 1/4 3 1 41m
autoscale-go-mph9z-deployment 1/4 4 1 41m
autoscale-go-mph9z-deployment 2/4 4 2 41m
autoscale-go-mph9z-deployment 3/4 4 3 41m
autoscale-go-mph9z-deployment 4/4 4 4 41m
autoscale-go-mph9z-deployment 4/7 4 4 41m
autoscale-go-mph9z-deployment 4/7 4 4 41m
autoscale-go-mph9z-deployment 4/7 4 4 41m
autoscale-go-mph9z-deployment 4/7 7 4 41m
autoscale-go-mph9z-deployment 5/7 7 5 41m
autoscale-go-mph9z-deployment 6/7 7 6 41m
autoscale-go-mph9z-deployment 6/8 7 6 41m
autoscale-go-mph9z-deployment 6/8 7 6 41m
autoscale-go-mph9z-deployment 6/8 7 6 41m
autoscale-go-mph9z-deployment 7/8 7 7 41m
autoscale-go-mph9z-deployment 7/8 8 7 41m
autoscale-go-mph9z-deployment 8/8 8 8 41m
autoscale-go-mph9z-deployment 8/18 8 8 42m
autoscale-go-mph9z-deployment 8/18 8 8 42m
autoscale-go-mph9z-deployment 8/18 8 8 42m
autoscale-go-mph9z-deployment 8/18 18 8 42m
autoscale-go-mph9z-deployment 9/18 18 9 42m
autoscale-go-mph9z-deployment 10/18 18 10 42m
autoscale-go-mph9z-deployment 11/18 18 11 42m
autoscale-go-mph9z-deployment 12/18 18 12 42m
autoscale-go-mph9z-deployment 13/18 18 13 42m
autoscale-go-mph9z-deployment 14/18 18 14 42m
autoscale-go-mph9z-deployment 15/18 18 15 42m
autoscale-go-mph9z-deployment 16/18 18 16 42m
autoscale-go-mph9z-deployment 17/18 18 17 42m
autoscale-go-mph9z-deployment 18/18 18 18 42m
autoscale-go-mph9z-deployment 18/1 18 18 43m <-- Load test ends
autoscale-go-mph9z-deployment 18/1 18 18 43m
autoscale-go-mph9z-deployment 1/1 1 1 43m
Here's a theory: are we sure that the activator Pod(s) are represented in each time bucket?
It looks like the activator reports every second in theory, and that we truncate buckets at a 2s window, so we SHOULD always see the activator pods.
Ok, I instrumented this a bit and it looks like the problem may actually be the way we aggregate stats from the scraper. The stats from the activator pods look perfectly fine, but the ones from the scraper all have the same PodName
. This means that all of the stats coming from the Pods get lumped together and they are all averaged to determine concurrency instead of being summed 😲 .
This means the oscillation we see is because the only times we get an accurate concurrency read is when the stats are coming from the activator or a single pod. Those prompt a scale up to the appropriate scale (and panic); at the appropriate scale we have the desired concurrency for each pod, which given the above bug means that when we exit panic mode we always think we need exactly one pod.
Here's what effectively turns our sum into an average: https://github.com/knative/serving/blob/95e6390a1153ebb8219b1c9b145c9b1239116d10/pkg/autoscaler/autoscaler.go#L90
I see that we try to compensate for this here: https://github.com/knative/serving/blob/95e6390a1153ebb8219b1c9b145c9b1239116d10/pkg/autoscaler/stats_scraper.go#L147-L150 (not sure why it isn't working)
Ok, so at least part of this is that wrk is ignoring my querystring. Going to try with hey
now. 🤦♂
With hey
things look much better. At 100ms
we jump to 7 replicas and stay there for the duration. At 10ms
we jump to 7 replicas and after a while drop to 3-4 replicas (with some jitter).
I think we still have accounting problems when we're dealing with extraordinarily fast user-container requests, but this isn't as pervasive a problem as I had first imagined. Too bad I wasted so much time with wrk
:(
@mattmoor good observations, thanks.
The autoscaler currently works off of "system total" values for concurrency, meaning: It handles the values it gets as them being the total of the system.
The scraper doesn't care about which pod is reporting metrics, so it extrapolates the values it reads to be representative of the whole system. As we're doing sampling, some form of extrapolation will need to happen at some point.
@markusthoemmes I just wonder if we'd have things this way if we were starting from scratch? As a name PodName
is very misleading in this context.
I think @yanweiguo is making scraping more accurate: https://github.com/knative/serving/pull/3831.
I also think that scaling up and scaling down should not be treated the same (as we do now). Scaling up is to address reliability and it should be instantaneous. Scaling down is to save costs, and it could be delayed or done gradually. What do you think?
The remaining work here is to add tests to make sure that this won't regress in the future, right?
Yes, adding tests is my plan for this issue. I think the tests may be flaky though, as it is the nature of sampling. The tests will help to assess how flaky it is and whether or not we need to adjust the scaling algorithm.
In what area(s)?
/area autoscale
What version of Knative?
HEAD
Expected Behavior
I would expect the calm(?) autoscaler to better respect the plans laid by the panic autoscaler.
Actual Behavior
This screenshot paints a grim picture:
We're trying to hit 1000 pods, with only ~40 ready and after 60 seconds we stop panicking and the calm autoscaler decides to almost wholly undo things.
Steps to Reproduce the Problem
istio.yaml
gateway's HPA tomaxReplicas: 10
config-autoscaler
to1000
config-domain
withmattmoor.io
Deploy the Knative Service:
Wait for the above to hit zero pods, and then run:
I generally see things go over this cliff pretty quickly, so the full
20m
duration is likely unnecessary./assign @josephburnett @k4leung4