knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.46k stars 1.14k forks source link

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Open julz opened 3 years ago

julz commented 3 years ago

Describe the feature

Currently we do not differentiate between a scrape that actually reports zero concurrency from a replica and just not having data for a particular bucket. This is fine if the network is fast and autoscaler is not overloaded because we will have data ~every second, but on a slow or overloaded network (or e.g. with a resource constrained host => slow QP response to scrapes) it could cause issues: when we average over the bucket we could think we have lower load than we do, and scale down (or fail to scale up) replicas incorrectly.

(This is somewhat related to https://github.com/knative/serving/issues/8377 in that if we introduce a work pool there's a greater danger of things backed up in the queue not getting stats every second).

mattmoor commented 3 years ago

cc @markusthoemmes @vagababov for thoughts

vagababov commented 3 years ago

Actually it's the same as the @duglin issue we revisited earlier :)

mattmoor commented 3 years ago

Want to find it and dupe?

vagababov commented 3 years ago

Done.

julz commented 3 years ago

FWIW I think this isn't totally the same as https://github.com/knative/serving/issues/8390. In https://github.com/knative/serving/issues/8390 @duglin is scraping at the correct rate, but the pockets of zero concurrency lead us to end up with less replicas than we actually need for peak load (because the simulated workload is a GitHub trigger firing multiple parallel events every 10 seconds or so, and we average over the full window). That one can potentially be fixed by the max-vs-average flag we've been informally chatting about: I'll pull out a top level issue for that now.

This one, I think, is slightly different. When the network is slow, or blips, we can get zero scaling data for a few seconds (or longer) and our current behaviour is to treat any gaps in data as if we'd actually seen concurrency zero. This means if the scraper loses connectivity to the pods, or the network is temporarily congested, we can start to rapidly scale down the workload as fast as max-scale-down-rate will let us. The 'max' switch described above that would potentially help bursty loads would cope with this slightly better, but I think it's a cross-cutting problem we should solve in both cases: for example by assuming the rolling average rather than 0 when we miss a scrape.

Edit: Spun out https://github.com/knative/serving/issues/9092.

duglin commented 3 years ago

Since #8390 was closed, I want to add my testcase to this one because I'm still seeing odd behavior even after #9092 is merged.

Script:

#!/bin/bash

set -e
PAUSE=${1:-10}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=1 | tail -1`
echo "Service: $URL"
echo "Pause: $PAUSE"
for i in `seq 1 20` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

And output I see today:

$ ./bug 
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
Pause: 10
1 : Running 50... real  0m19.686s # pods: 72
2 : Running 50... real  0m10.168s # pods: 72
3 : Running 50... real  0m10.177s # pods: 72
4 : Running 50... real  0m10.181s # pods: 48
5 : Running 50... real  0m10.186s # pods: 72
6 : Running 50... real  0m10.162s # pods: 72
7 : Running 50... real  0m10.183s # pods: 36
8 : Running 50... real  0m10.190s # pods: 72
9 : Running 50... real  0m10.174s # pods: 72
10 : Running 50... real 0m10.192s # pods: 36
11 : Running 50... real 0m10.204s # pods: 72
12 : Running 50... real 0m10.171s # pods: 72
13 : Running 50... real 0m10.191s # pods: 36
14 : Running 50... real 0m20.176s # pods: 72
15 : Running 50... real 0m10.192s # pods: 72
16 : Running 50... real 0m10.189s # pods: 72
17 : Running 50... real 0m20.139s # pods: 72
18 : Running 50... real 0m10.186s # pods: 72
19 : Running 50... real 0m10.213s # pods: 72
20 : Running 50... real 0m10.188s # pods: 39
Service 'echo' successfully deleted in namespace 'default'.

Notice how the # of pods isn't consistent and it going below 50 doesn't seem right. But the 2x latency at times is obviously the biggest concern.

duglin commented 3 years ago

Using:

URL=`kn service create echo --image duglin/echo --concurrency-limit=1 \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=120s \
  --annotation-revision autoscaling.knative.dev/window=6s \

helped w.r.t. latency - it was around 10 seconds consistently. However, I had 72 pods the entire time, which just doesn't seem right when I only have 50 requests. Yes I know that TU (70%) is probably why I get an extra 12 pods, but from a user's POV it's hard to explain. I wonder if we need to make it more clear that this "utilization" isn't just per pod, but across all pods and really should be look at like some kind of "over provisioning" flag. Then it's clear that anything other than 100% means they're asking for "extra" unused space. And this space is calculated across all pods, not just within one.

duglin commented 3 years ago

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

vagababov commented 3 years ago

Run with TU=95%? :)

On Sat, Oct 17, 2020 at 9:39 AM Doug Davis notifications@github.com wrote:

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knative/serving/issues/8610#issuecomment-711042284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF2WXZ7YOGNK6CFTB73KLLSLHCCZANCNFSM4OYJZFJQ .

duglin commented 3 years ago

Just FYI:

$ cat bug
#!/bin/bash

set -e
PAUSE=${PAUSE:-20}
COUNT=${COUNT:-20}
SDD=${SDD:-6s}
TU=${TU:-100}
WIN=${WIN:-6s}
CC=${CC:-1}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=$CC \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=$SDD \
  --annotation-revision autoscaling.knative.dev/targetUtilizationPercentage=$TU \
  --annotation-revision autoscaling.knative.dev/WIN=$WIN \
  | tail -1`
echo "Service: $URL"
echo "PAUSE: $PAUSE"
echo "CC: $CC"
echo "SDD: $SDD"
echo "TU: $TU"
echo "WIN: $WIN"

for i in `seq 1 $COUNT` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo
$ PAUSE=20 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 20
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m18.245s # pods: 50
2 : Running 50... real  0m20.091s # pods: 49
3 : Running 50... real  0m20.108s # pods: 50
4 : Running 50... real  0m20.143s # pods: 50
5 : Running 50... real  0m19.347s # pods: 50
6 : Running 50... real  0m20.105s # pods: 49
7 : Running 50... real  0m20.155s # pods: 50
8 : Running 50... real  0m20.155s # pods: 50
9 : Running 50... real  0m20.163s # pods: 50
10 : Running 50... real 0m19.596s # pods: 50
11 : Running 50... real 0m20.126s # pods: 50
12 : Running 50... real 0m19.307s # pods: 50
13 : Running 50... real 0m20.131s # pods: 50
14 : Running 50... real 0m20.097s # pods: 50
15 : Running 50... real 0m19.604s # pods: 50
16 : Running 50... real 0m20.141s # pods: 50
17 : Running 50... real 0m20.132s # pods: 50
18 : Running 50... real 0m20.116s # pods: 49
19 : Running 50... real 0m20.181s # pods: 50
20 : Running 50... real 0m20.179s # pods: 49
$ PAUSE=10 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 10
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m19.904s # pods: 50
2 : Running 50... real  0m10.185s # pods: 29
3 : Running 50... real  0m10.182s # pods: 50
4 : Running 50... real  0m20.176s # pods: 50
5 : Running 50... real  0m10.180s # pods: 50
6 : Running 50... real  0m10.168s # pods: 25
7 : Running 50... real  0m20.164s # pods: 34
8 : Running 50... real  0m19.300s # pods: 50
9 : Running 50... real  0m10.193s # pods: 50
10 : Running 50... real 0m10.180s # pods: 50
11 : Running 50... real 0m20.177s # pods: 50
12 : Running 50... real 0m10.174s # pods: 49
13 : Running 50... real 0m10.187s # pods: 26
14 : Running 50... real 0m10.167s # pods: 50
15 : Running 50... real 0m20.191s # pods: 32
16 : Running 50... real 0m18.220s # pods: 50
17 : Running 50... real 0m10.186s # pods: 50
18 : Running 50... real 0m10.174s # pods: 50
19 : Running 50... real 0m20.141s # pods: 50
20 : Running 50... real 0m10.190s # pods: 35
Service 'echo' successfully deleted in namespace 'default'.
github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

duglin commented 3 years ago

/reopen /remove-lifecycle stale

evankanderson commented 3 years ago

@vagababov @julz

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

/triage needs-user-input

(I'll also point out that this bug timed out, so if it's a major issue, we may need to reconsider our priorities. If it's not a major issue, we may want to consider allowing it to time out again.)

julz commented 3 years ago

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

Unfortunately not. What sounds easy is actually a bit tricky because of how we do metric aggregation. Having said that it's possible the new pluggable aggregation stuff @vagababov has added may make this more tractable 🤔 .

evankanderson commented 3 years ago

/remove-triage needs-user-input /triage accepted

I'm not sure that Victor is going to land anything here; is this still an issue, and what priority?

julz commented 3 years ago

I do think it's a legit issue ("we do not distinguish failed/slow scrapes from zero concurrency and we should or we'll potentially scale down due to network blips") that needs more work to progress than a "good first issue" should. If we had a 'this is something someone who wants something meaty could work on' tag, Id add that to this.

evankanderson commented 3 years ago

/help

knative-prow-robot commented 3 years ago

@evankanderson: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/knative/serving/issues/8610): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
nader-ziada commented 2 years ago

/assign