Closed diegoparrilla closed 1 year ago
Hey @diegoparrilla,
the API rate limit is calculated on a per-project basis. In general I do not expect the csi-driver
to reach the rate limit. Did you make a lot of other requests in the same project?
To monitor the number of API requests made by the csi-driver
, you can configure Prometheus (or other metrics tools) to scrape the metrics from the endpoint :9189
on all csi-driver pods. This will include metrics with the prefix hcloud_api_
which provide information about all requests made to the API. Using these metrics, you can estimate how many requests are made per time frame, and if this is too many.
To answer your questions:
- Is there a way to reduce the number of requests?
No, but the scenario you described should not exceed the rate limit.
- Any way to increase the rate limit of the API?
You can message the support for this.
- Any other suggested workaround will be welcomed!
The rate limit "bucket" refills with 1 request per second, so just waiting a few minutes should be fine if you are not doing anything else in the project.
Thank you for your prompt response, @apricote.
I would consider very low the number of requests made by the Kubernetes cluster: it was the creation of Kafka and Redis clusters. Strimzi Kafka Operator first, and I tried with OT Redis Operator later. Each cluster had six pods, and each pod had one or two hcloud
volumes depending on the role of the pod. So it was an ordinary action from my side; I didn't create and destroy repeatedly or do any stress tests.
I guess the operators ask for the storage volume at the creation time of each cluster. Then the pods wait for the volumes to be ready, and they poll for the change of the status too fast (or the Hetzner creates the volume too slow), triggering the rate-limit countermeasure. I have also verified that by waiting more than 10 minutes, the rate-limit restriction disappears, and sometimes the cluster completes the setup. But only if it does not timeout first.
I will write to support explaining the situation. Still, at least I understand the limitations of the setup and can act accordingly.
In addition to my previous recommendation, you can also set the environment variable HCLOUD_DEBUG
to see all full HTTP requests made to our API, as well as the responses. The responses include Ratelimit-X
headers with additional information about your current rate-limiting status, such as the number of remaining requests.
I will close the issue, if you find that the csi-driver still uses too many requests, please feel free to reopen the issue or to create a new one.
I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?
I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?
Sounds like the issue I had creating the volumes, yes.
WatchProgress
triggers a lot of calls to the API on 4 actions:
I ran into a similar issue and can confirm that the limit gets hit. And I think it is because of this.
The only way I see to not have this issue in the future anymore would be to make use of https://docs.hetzner.cloud/#actions-get-all-actions instead of the single action. This would at least require some code refactoring...
I think it makes sense that the rate limit is hit here. If the action takes some time, https://github.com/hetznercloud/hcloud-go/blob/main/hcloud/action.go#L245 combined with the default pollInterval of 500ms could cause the rate limit after 30 minutes, for one volume, 5 minutes for 6 volumes. Right?
This only applies if the async action takes 5 minutes, usually create/attach/detach is handled within seconds, so it should not cause so many requests.
But if for some reason the server is locked in the API, attaching the volume might take longer. I think switching to exponential backoff will help with this, but we first need to implement it in hcloud-go (see hetznercloud/hcloud-go#221).
We made some improvements around this topic:
Closing this issue as the specific issue that was triggered here is non-reproducible. Please open a new ticket if you encounter such issues again.
Hi all,
I have found this error twice, when creating a Kafka and a Redis cluster with three pods and two volumes each. This error makes the deployment to fail after several minutes:
Describing the failing pod (all other 5 pods worked):
returns:
As I said I found the same problem creating a Kafka cluster with six pods and six volumes. It seems that the rate limit of the Hetzner API is too low for the number of requests performed by the
csi-driver
. So my questions are:Thanks!