helm / helm

The Kubernetes Package Manager
https://helm.sh
Apache License 2.0
26.3k stars 7k forks source link

Cannot change timeout on API calls #9805

Open max-allan-surevine opened 3 years ago

max-allan-surevine commented 3 years ago

My organisation's openshift cluster has many CRDs and throttles the client connection (If I understand it correctly). Often when busy the throttling/performance is so bad that helm operations fail. I'd like to increase the timeout on the API calls. Which looks like a "--timeout" setting. However, if I try to change the timeout (to a value lower than the typical throttle delay) it still appears to have a 32s timeout... (And doesn't fail due to the request taking too long.)

helm install --timeout 10s files -f ../files.yaml  chart
I0615 14:02:04.936726   33698 request.go:668] Waited for 1.148672262s due to client-side throttling, not priority and fairness, request: GET:https://api.server:443/apis/events.k8s.io/v1?timeout=32s
I0615 14:02:14.937525   33698 request.go:668] Waited for 11.14860773s due to client-side throttling, not priority and fairness, request: GET:https://api.server:443/apis/helm.openshift.io/v1beta1?timeout=32s
NAME: files
LAST DEPLOYED: Tue Jun 15 14:02:16 2021
....etc, notes from normal install...

Example of a fail looks the same as above, but after the last "waited for" I see :

Error: release files failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

(I use --atomic normally now because of this problem!)

I would like to be able to increase the timeout from 32s to a higher value. I know the API server is overloaded and would rather helm wait a few more seconds for it than ME have to wait till 4AM to deploy my helm chart when nobody else is around....

Output of helm version:

version.BuildInfo{Version:"v3.6.0", GitCommit:"7f2df6467771a75f5646b7f12afb408590ed1755", GitTreeState:"dirty", GoVersion:"go1.16.4"}

Output of kubectl version: kubectl has been removed. There was a suggestion this issue was fixed in recent versions of the openshift client (oc)

$ oc version
Client Version: 4.7.0-202104250659.p0-95881af
Kubernetes Version: v1.20.0+7d0a2b2

Cloud Provider/Platform (AKS, GKE, Minikube etc.): Openshift

hickeyma commented 3 years ago

@max-allan-surevine Do you mind showing the command you are running with the flags?

max-allan-surevine commented 3 years ago

Oops! Yes, how did I miss that, will edit! It was on the same line as my triple quote so got swallowed by the markdown.

hickeyma commented 3 years ago

Ok, some things I noticed. You are using a timeout of 10 seconds (--timeout 10s). Do you want this to be longer? Also, can you try passing the --wait flag?

max-allan-surevine commented 3 years ago

I set the 10s timeout so that it should timeout before the 11second wait. To highlight the fact that it is not respecting the timeout value I set. I would actually want it to be higher, but setting it to less than the 11s message highlights that it is using neither the 5m default or the 10s supplied value.

[master] $ helm delete files --timeout 10s --wait
Error: unknown flag: --wait
[master] $ helm delete files --timeout 10s
I0616 10:59:21.444921   41729 request.go:668] Waited for 1.176294145s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/pipelines.openshift.io/v1alpha1?timeout=32s
I0616 10:59:31.446800   41729 request.go:668] Waited for 11.177602333s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/monitoring.coreos.com/v1?timeout=32s
release "files" uninstalled
[master] $ helm install files --timeout 10s --wait -f ../files.yaml  chart
I0616 11:00:04.039816   41786 request.go:668] Waited for 1.167701664s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/workspace.devfile.io/v1alpha1?timeout=32s
I0616 11:00:14.238909   41786 request.go:668] Waited for 11.366030019s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/caching.internal.knative.dev/v1alpha1?timeout=32s
Error: timed out waiting for the condition
[master] $ helm install files --timeout 10s --wait -f ../files.yaml  chart
Error: cannot re-use a name that is still in use

The "Error: timed out" happens after about 30s. Not the default 5m0s that "--timeout" is set to according to the docs and not the 10s I set on the CLI. With a 10s timeout, I should never see the "waited for 11s" message. Right?

And now I have a deployment which is in who knows what state? Clearly something timed out and failed, but something successfully completed. It didn't wait for 5 minutes or 10secs. If it did wait for 5mins, this error probably wouldn't happen.

Hence the title of the bug : Cannot change the timeout on API calls Whatever I set on the CLI , it always uses 32s.

[master] $ helm delete --timeout 5m0s files 
I0616 11:10:10.950751   42031 request.go:668] Waited for 1.153073128s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/jenkins.io/v1alpha3?timeout=32s
I0616 11:10:21.150205   42031 request.go:668] Waited for 11.352028467s due to client-side throttling, not priority and fairness, request: GET:https://api.local:443/apis/planetscale.com/v1alpha1?timeout=32s
release "files" uninstalled

Still ends each API call with "?timeout=32s"

invidian commented 2 years ago

Still ends each API call with "?timeout=32s"

This is the timeout for individual requests, which I'd expect client-go to retry performing. This timeout is also configured when creating rest client from kubeconfig. In case --timeout 10s is given, I'd expect context for the request to be cancelled, so then the error message you get should be different.

Also given that release has been uninstalled, the message seems to be only a warning, right?

This issue seems like a feature request to be able to configure this default: https://github.com/soltysh/kubernetes/blob/7bd48a7e2325381cb777d0ea1ff89b2ecece23b6/staging/src/k8s.io/client-go/discovery/discovery_client.go#L51

max-allan-surevine commented 2 years ago

From the help for install : --timeout duration time to wait for any individual Kubernetes operation (like Jobs for hooks) (default 5m0s)

Is creating an object like a secret or a deployment or ...whatever it is doing... not an "individual operation" ??? What is an individual Kubernetes operation?

Going by the documentation of --timeout, this is not a feature request. It is at least a bug with the documentation of what timeout actually means. But I'd prefer it if someone fixed the timeout rather than redocumenting it.

Yes, it is a warning, but sometimes if the cluster or network is slow it is an error. "Error: timed out waiting for the condition" And if it is slow to complete then the rollback operations can be slow too and sometimes exceed the 32s timeout and the rollback fails to complete successfully leaving a mess.

invidian commented 2 years ago

@max-allan-surevine good points. I think the documentation for --timeout could also be clarified then. Looking briefly at the code, it seems Timeout is only used for executing hooks if you don't specify --wait? I think improving documentation should be treated as a separate issue from the timeouts I mentioned before.

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

invidian commented 2 years ago

Not stale please

nwsparks commented 2 years ago

I'm also running into issues with this when installing large helm charts due to our VPN. Being able to set a timeout or throttle concurrent calls would be extremely helpful.

A good example is this chart which installs many sub charts: https://github.com/newrelic/helm-charts/tree/master/charts/nri-bundle

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

invidian commented 2 years ago

This is still a problem.

gecube commented 2 years ago

The solution is simple as 2*2. One needs to add new command-line argument like "--api-server-timeout" for helm and pass it's value to client-go library.

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

invidian commented 2 years ago

Still relevant

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

daro1337 commented 1 year ago

This is still a problem.

sachinms27 commented 1 year ago

Still a problem.

sachinms27 commented 1 year ago

Can someone suggest a workaround please? Retries aren't helping us as we have a VPN between our on-prem network and the cloud V-Net which can become choked for many hours.

joejulian commented 1 year ago

Maybe run helm from a pod or vm that doesn't cross a vpn?

github-actions[bot] commented 1 year ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

joejulian commented 1 year ago

Since it's been a while since my suggestion and there's been no further conversation about this, I'm going to go ahead and close it.

varunpalekar commented 1 year ago

We are still facing problem on clusters having 100+ CRDs

alakdae commented 10 months ago

Same here, random timeouts, would love the option to change API calls timeout

AndresPinerosZen commented 6 months ago

Please support this.

L1ghtman2k commented 3 months ago

@joejulian, could we reopen this? We are running on microk8s directly against the host. /openapi/v3 endpoints can take >30 seconds to return the schema with large amount of CRDs on the cluster.

I don't think we can also address the https://github.com/hashicorp/terraform-provider-helm/issues/1156, until this is addressed

joejulian commented 3 months ago

Sure, done. 🙌

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

liwoove commented 10 hours ago

Hi, this is a requested feature within our organization as well, could somone take a look the review above?

Thank you.