hetznercloud / csi-driver

Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes
MIT License
643 stars 103 forks source link

Rate limit exceeded creating 6 or more volumes #346

Closed diegoparrilla closed 1 year ago

diegoparrilla commented 1 year ago

Hi all,

I have found this error twice, when creating a Kafka and a Redis cluster with three pods and two volumes each. This error makes the deployment to fail after several minutes:

Describing the failing pod (all other 5 pods worked):

kubectl describe pod redis-cluster-follower-1 --namespace redis-operator

returns:

Events:
  Type     Reason              Age                   From                     Message
  ----     ------              ----                  ----                     -------
  Normal   Scheduled           9m28s                 default-scheduler        Successfully assigned redis-operator/redis-cluster-follower-1 to node-2-default
  Warning  FailedAttachVolume  67s (x12 over 9m28s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-f6490646-0aee-413d-af34-fc52137270f1" : rpc error: code = Internal desc = failed to publish volume: limit of 3600 requests per hour reached (rate_limit_exceeded)
  Warning  FailedMount         38s (x4 over 7m26s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[redis-cluster-follower], unattached volumes=[redis-cluster-follower kube-api-access-4tfh2]: timed out waiting for the condition

As I said I found the same problem creating a Kafka cluster with six pods and six volumes. It seems that the rate limit of the Hetzner API is too low for the number of requests performed by the csi-driver. So my questions are:

  1. Is there a way to reduce the number of requests?
  2. Any way to increase the rate limit of the API?
  3. Any other suggested workaround will be welcomed!

Thanks!

apricote commented 1 year ago

Hey @diegoparrilla,

the API rate limit is calculated on a per-project basis. In general I do not expect the csi-driver to reach the rate limit. Did you make a lot of other requests in the same project?

To monitor the number of API requests made by the csi-driver, you can configure Prometheus (or other metrics tools) to scrape the metrics from the endpoint :9189 on all csi-driver pods. This will include metrics with the prefix hcloud_api_ which provide information about all requests made to the API. Using these metrics, you can estimate how many requests are made per time frame, and if this is too many.

To answer your questions:

  1. Is there a way to reduce the number of requests?

No, but the scenario you described should not exceed the rate limit.

  1. Any way to increase the rate limit of the API?

You can message the support for this.

  1. Any other suggested workaround will be welcomed!

The rate limit "bucket" refills with 1 request per second, so just waiting a few minutes should be fine if you are not doing anything else in the project.

diegoparrilla commented 1 year ago

Thank you for your prompt response, @apricote.

I would consider very low the number of requests made by the Kubernetes cluster: it was the creation of Kafka and Redis clusters. Strimzi Kafka Operator first, and I tried with OT Redis Operator later. Each cluster had six pods, and each pod had one or two hcloud volumes depending on the role of the pod. So it was an ordinary action from my side; I didn't create and destroy repeatedly or do any stress tests.

I guess the operators ask for the storage volume at the creation time of each cluster. Then the pods wait for the volumes to be ready, and they poll for the change of the status too fast (or the Hetzner creates the volume too slow), triggering the rate-limit countermeasure. I have also verified that by waiting more than 10 minutes, the rate-limit restriction disappears, and sometimes the cluster completes the setup. But only if it does not timeout first.

I will write to support explaining the situation. Still, at least I understand the limitations of the setup and can act accordingly.

apricote commented 1 year ago

In addition to my previous recommendation, you can also set the environment variable HCLOUD_DEBUG to see all full HTTP requests made to our API, as well as the responses. The responses include Ratelimit-X headers with additional information about your current rate-limiting status, such as the number of remaining requests.

I will close the issue, if you find that the csi-driver still uses too many requests, please feel free to reopen the issue or to create a new one.

jonasbadstuebner commented 1 year ago

I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?

diegoparrilla commented 1 year ago

I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?

Sounds like the issue I had creating the volumes, yes.

jonasbadstuebner commented 1 year ago

WatchProgress triggers a lot of calls to the API on 4 actions:

I ran into a similar issue and can confirm that the limit gets hit. And I think it is because of this.

jonasbadstuebner commented 1 year ago

The only way I see to not have this issue in the future anymore would be to make use of https://docs.hetzner.cloud/#actions-get-all-actions instead of the single action. This would at least require some code refactoring...

apricote commented 1 year ago

I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?

This only applies if the async action takes 5 minutes, usually create/attach/detach is handled within seconds, so it should not cause so many requests.

But if for some reason the server is locked in the API, attaching the volume might take longer. I think switching to exponential backoff will help with this, but we first need to implement it in hcloud-go (see hetznercloud/hcloud-go#221).

apricote commented 1 year ago

We made some improvements around this topic:

Closing this issue as the specific issue that was triggered here is non-reproducible. Please open a new ticket if you encounter such issues again.