kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.06k stars 3.97k forks source link

cluster-autoscaler with cloudprovider hetzner utilizes all available api.hetzner.cloud requests limits #4133

Closed kalamba closed 2 years ago

kalamba commented 3 years ago

Which component are you using?:

cluster-autoscaler (cloud provider hetzner)

What version of the component are you using?:

Component version: 1.21

What k8s version are you using (kubectl version)?: v1.20.5

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:02:01Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: Hetzner Cloud

What did you expect to happen?:

What happened instead?: When I ran cluster-autoscaler in my k8s cluser in hetzner cloud, I saw that the api request limits are running out. Hetzner cloud api have rate limit requests - 3600 requests per hour
https://docs.hetzner.cloud/#rate-limiting

Cluster-autoscaler with default settings utilizes all available api requests limits (ratelimit-remaining: 0)

curl -I -H "Authorization: Bearer $HCLOUD_TOKEN" https://api.hetzner.cloud/v1/servers Output
The RateLimit-Limit header contains the total number of requests you can perform per hour.
The RateLimit-Remaining header contains the number of requests remaining in the current rate limit time frame.

HTTP/2 405 
date: Thu, 10 Jun 2021 14:18:07 GMT
content-type: application/json
ratelimit-limit: 3600
ratelimit-remaining: 0
ratelimit-reset: 1623338287
allow: GET, POST, OPTIONS
x-correlation-id: 30642f1a-0fc2-4ff7-b2d2-33c9aba3142b
strict-transport-security: max-age=15724800; includeSubDomains
access-control-allow-origin: *
access-control-allow-credentials: true

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Even I run cluster-autoscaler without nodes group (- --nodes=1:10:CPX11:FSN1:pool1)

cluster-autoscaler.yaml Output
...
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:latest  # or your custom image
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --cloud-provider=hetzner
            - --stderrthreshold=info
...

tcpdump in cluster-autoscaler container shows about 5 PPS to api.hetzner.cloud ip(213.239.246.1):

tcpdump from cluster-autoscaler to api.hetzner.cloud (it's about 5 pps) Output
16:03:55.031406 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 366267, win 1891, options [nop,nop,TS val 1875645729 ecr 2797296459], length 0
16:03:55.031561 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 367055, win 1894, options [nop,nop,TS val 1875645729 ecr 2797296459], length 0
16:03:55.032190 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7355:7390, ack 367055, win 1894, options [nop,nop,TS val 1875645730 ecr 2797296459], length 35
16:03:55.032982 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7390:7428, ack 367055, win 1894, options [nop,nop,TS val 1875645730 ecr 2797296459], length 38
16:03:55.265521 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 371932, win 1903, options [nop,nop,TS val 1875645963 ecr 2797296693], length 0
16:03:55.269839 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7428:7463, ack 371963, win 1903, options [nop,nop,TS val 1875645967 ecr 2797296693], length 35
16:04:05.317037 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7463:7501, ack 371963, win 1903, options [nop,nop,TS val 1875656014 ecr 2797296701], length 38
16:04:05.485699 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 372570, win 1906, options [nop,nop,TS val 1875656183 ecr 2797306913], length 0
16:04:05.487676 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7501:7539, ack 372570, win 1906, options [nop,nop,TS val 1875656185 ecr 2797306913], length 38
16:04:05.884046 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 376629, win 1914, options [nop,nop,TS val 1875656581 ecr 2797307312], length 0
16:04:05.884296 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 378139, win 1917, options [nop,nop,TS val 1875656582 ecr 2797307312], length 0
16:04:05.884418 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7539:7574, ack 378170, win 1917, options [nop,nop,TS val 1875656582 ecr 2797307312], length 35
16:04:05.891516 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7574:7612, ack 378170, win 1917, options [nop,nop,TS val 1875656589 ecr 2797307315], length 38
16:04:06.581252 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 381242, win 1923, options [nop,nop,TS val 1875657279 ecr 2797308009], length 0
16:04:06.581288 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [.], ack 383024, win 1926, options [nop,nop,TS val 1875657279 ecr 2797308009], length 0
16:04:06.581672 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7612:7647, ack 383055, win 1926, options [nop,nop,TS val 1875657279 ecr 2797308009], length 35
16:04:06.582749 IP 10.0.2.18.42262 > 213.239.246.1.https: Flags [P.], seq 7647:7685, ack 383055, win 1926, options [nop,nop,TS val 1875657280 ecr 2797308009], length 38

Hetzner cloud api ip: dig api.hetzner.cloud 213.239.246.1

sergeyshevch commented 3 years ago

@kalamba Reproduced on my hetzner cloud deployment. That look realy bad because it can prevent cluster from scaling in some cases

curl -I -H "Authorization: Bearer $HCLOUD_TOKEN" https://api.hetzner.cloud/v1/servers
HTTP/2 429 
date: Thu, 08 Jul 2021 18:27:17 GMT
content-type: application/json
ratelimit-limit: 3600
ratelimit-remaining: 0
ratelimit-reset: 1625772437
x-correlation-id: bbbc20a1-ac8a-4ef3-8b78-2158d7bbba44
strict-transport-security: max-age=15724800; includeSubDomains
access-control-allow-origin: *
access-control-allow-credentials: true
jawabuu commented 3 years ago

Ping @LKaemmerling @kalamba Is there a workaround for this?

sergeyshevch commented 3 years ago

@jawabuu IDK but in my current setup it doesn't prevent CA from work. So i still use it. But it can be only my case because my cluster doesn't scale very frequently

jawabuu commented 3 years ago

@sergeyshevch CA does work but it exhausts your available limits. So if you need to make other API calls, e.g. with Terraform they will fail. Also depending on the size of your cluster/nodes and how soon you need scale-up events to occur, autoscaling will be affected.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4133#issuecomment-1037188248): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 2 years ago

@teksuo: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/4133#issuecomment-1231384050): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.