helm / helm

The Kubernetes Package Manager
https://helm.sh
Apache License 2.0
27.14k stars 7.13k forks source link

Errors on "helm list" AND "helm install" #7997

Open avolution opened 4 years ago

avolution commented 4 years ago

Output of helm version: version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}

Output of kubectl version: Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2"} Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.9"}

Cloud Provider/Platform (AKS, GKE, Minikube etc.): GKE


On Helm list I get

Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR

On Helm install chart I get:

request.go:924] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body) Error: unable to build kubernetes objects from release manifest: unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout exceeded while reading body

"Helm delete" is working. Was able to uninstall a release


Additional notes:

omkensey commented 4 years ago

Does kubectl work? Can you create/update/delete resources that way?

avolution commented 4 years ago

"Does kubectl work? Can you create/update/delete resources that way?"

Yes, this is working

avolution commented 4 years ago

Here is the output of helm --debug with more details

Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR helm.go:84: [debug] stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR list: failed to list helm.sh/helm/v3/pkg/storage/driver.(Secrets).List /private/tmp/helm--615sa8/src/helm.sh/helm/pkg/storage/driver/secrets.go:87 helm.sh/helm/v3/pkg/action.(List).Run /private/tmp/helm-/src/helm.sh/helm/pkg/action/list.go:154 main.newListCmd.func1 /private/tmp/helm-/src/helm.sh/helm/cmd/helm/list.go:80 github.com/spf13/cobra.(Command).execute /private/tmp/helm-/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842 github.com/spf13/cobra.(Command).ExecuteC /private/tmp/helm--/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950 github.com/spf13/cobra.(Command).Execute /private/tmp/helm--/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887 main.main /private/tmp/helm-/src/helm.sh/helm/cmd/helm/helm.go:83 runtime.main /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/proc.go:203 runtime.goexit /usr/local/Cellar/go@1.13/1.13.10_1/libexec/src/runtime/asm_amd64.s:1357

avolution commented 4 years ago

The network traffic on that process is on around 30Kb per second until it fails

chalitha1989 commented 4 years ago

Are you using the default namespace or a separate one for the Helm deployment?. I faced the same issue with same error log and I recreated the namespace when I observe that helm commands work with other namespaces. It resolved the connectivity issue. Unfortunately I did not dig further to identify the root cause.

technosophos commented 4 years ago

I recently saw this behavior on a Kubernetes cluster where one of the Kubernetes API server proxies was misbehaving, causing Helm to wait for long periods of time before finally giving up with a network connection. I could replicate it using kubectl commands that required longer connection times.

Other tests you can try:

There is a very high probability that the problem here has to do with either a proxy or the Kubernetes API server itself.

avolution commented 4 years ago

@technosophos how did you fixed that?

Yea I also think that is a Problem on the Kubernetes Server Setup. I use Google Kubernetes Engine so I dont have that deep Insights into the logs of my cluster(or?) Is there a way to monitor that in GCP? Or also a way to "reset" the cluster API/Proxy Settings

I tried also to install something under another namespace. Got the same Error

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

shay-berman commented 4 years ago

I have similar issue with GKE and helm

> helm ls
Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR

any updates?

technosophos commented 4 years ago

We have no updates on our side, as we don't believe this is a Helm error so much as a Kubernetes control plane error. Right now, all of the complaints that I know of are specific to GCP. You might have better luck asking someone there about the issue.

To my knowledge EKS has never had this problem. I experienced it on AKS a year ago, and it has since been fixed. I know of no cases involving on-prem versions of Kubernetes.

So at this point, we believe the error to be specific to GKE's internal control plane implementation.

mortent commented 4 years ago

@shay-berman Can you share the output of helm version, kubectl version and which OS you are running? I'm part of the Kubernetes team at Google and want to see if we can figure out what is going on here.

lavalamp commented 4 years ago

LIST operations that take more than 60 seconds hit the global timeout and are terminated by the server. The error message combined with the "network traffic [being] 30Kb per second until it fails" makes me suspect that is what is happening, with the likely cause being a slow internet connection between the user and the control plane. A prior commenter suggested running the command from a pod in the cluster, I would try that.

jpbochi commented 4 years ago

same thing here. We need, at least, a workaround. What's the best known one?

🜚 helm --debug list -A
Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR
helm.go:94: [debug] stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR
list: failed to list
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List
    /private/tmp/helm-20200923-64956-rldbbk/pkg/storage/driver/secrets.go:87
helm.sh/helm/v3/pkg/action.(*List).Run
    /private/tmp/helm-20200923-64956-rldbbk/pkg/action/list.go:154
main.newListCmd.func1
    /private/tmp/helm-20200923-64956-rldbbk/cmd/helm/list.go:79
github.com/spf13/cobra.(*Command).execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main
    /private/tmp/helm-20200923-64956-rldbbk/cmd/helm/helm.go:93
runtime.main
    /usr/local/Cellar/go/1.15.2/libexec/src/runtime/proc.go:204
runtime.goexit
    /usr/local/Cellar/go/1.15.2/libexec/src/runtime/asm_amd64.s:1374
🜚 helm version
version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"dirty", GoVersion:"go1.15.2"}
🜚 kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-gke.801", GitCommit:"3a26ac58e2a1ce0170c304c4134149ce3526eb8a", GitTreeState:"clean", BuildDate:"2020-09-28T17:32:58Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

for the record, my cluster is on GCP.

gmoshiko commented 4 years ago

We have no updates on our side, as we don't believe this is a Helm error so much as a Kubernetes control plane error. Right now, all of the complaints that I know of are specific to GCP. You might have better luck asking someone there about the issue.

To my knowledge EKS has never had this problem. I experienced it on AKS a year ago, and it has since been fixed. I know of no cases involving on-prem versions of Kubernetes.

So at this point, we believe the error to be specific to GKE's internal control plane implementation.

hey, I can confirm it happen to me on EKS, on multiple clusters, helm version is 3.3.0 when it happens after a few mins it just starts working without doing anything, it happened to me a few times in the last month. I must say it didn't look to me like this is a helm or an EKS problem but a WSL(windows subsystem linux) or VPN problem. I didnt try to debug this as this is gone after a few mins, but I will try to investigate next time it happens to me.

dmitriy-lukyanchikov commented 3 years ago

The quick workaround is to delete previous release versions. I had the same issue with prometheus-stack chart. so I list all secrets where helm3 saves data about releases. so I list secrets

kubectl get secrets --all-namespaces

found sh.helm.release.v1.kube-prometheus-stack.v7

and delete all previous version

kubectl delete secrets -n monitoring sh.helm.release.v1.kube-prometheus-stack.v1 ...

and helm ls start to work

astlock commented 3 years ago

I have a similar problem, but helm ls works as well as k get secret --all-namespaces

time helm install squid ./ --values values.yaml --timeout 15m0s --wait --v 6 --debug
install.go:172: [debug] Original chart version: ""
install.go:189: [debug] CHART PATH: /LOCAL_PATH_TO_CHART/squid

Error: create: failed to create: context deadline exceeded
helm.go:81: [debug] context deadline exceeded
create: failed to create
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).Create
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/storage/driver/secrets.go:164
helm.sh/helm/v3/pkg/storage.(*Storage).Create
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/storage/storage.go:66
helm.sh/helm/v3/pkg/action.(*Install).Run
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/action/install.go:320
main.runInstall
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/install.go:241
main.newInstallCmd.func2
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/install.go:120
github.com/spf13/cobra.(*Command).execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/helm.go:80
runtime.main
    /usr/local/Cellar/go/1.15.4/libexec/src/runtime/proc.go:204
runtime.goexit
    /usr/local/Cellar/go/1.15.4/libexec/src/runtime/asm_amd64.s:1374

real    1m6.193s
user    0m0.777s
sys 0m0.897s
time helm ls
NAME    NAMESPACE   REVISION    UPDATED STATUS  CHART   APP VERSION

real    0m5.349s
user    0m0.058s
sys 0m0.015s
time k get secret --all-namespaces -l "owner=helm" | wc -l
      51

real    0m5.053s
user    0m0.130s
sys 0m0.051s

UPD: decreasing the number of secrets doesn't help.

k get secret --all-namespaces -l "owner=helm" | wc -l
      24
helm uninstall squid ; helm install squid ./ --values values.yaml --timeout 15m0s --wait --v 6 --debug
Error: uninstall: Release not loaded: squid: release: not found

Error: create: failed to create: context deadline exceeded
helm.go:81: [debug] context deadline exceeded
create: failed to create
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).Create
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/storage/driver/secrets.go:164
helm.sh/helm/v3/pkg/storage.(*Storage).Create
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/storage/storage.go:66
helm.sh/helm/v3/pkg/action.(*Install).Run
    /private/tmp/helm-20201111-97167-dwh5s1/pkg/action/install.go:320
main.runInstall
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/install.go:241
main.newInstallCmd.func2
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/install.go:120
github.com/spf13/cobra.(*Command).execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842
github.com/spf13/cobra.(*Command).ExecuteC
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950
github.com/spf13/cobra.(*Command).Execute
    /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main
    /private/tmp/helm-20201111-97167-dwh5s1/cmd/helm/helm.go:80
runtime.main
    /usr/local/Cellar/go/1.15.4/libexec/src/runtime/proc.go:204
runtime.goexit
    /usr/local/Cellar/go/1.15.4/libexec/src/runtime/asm_amd64.s:1374
lavalamp commented 3 years ago

Depending on the encryption at rest implementation, listing all secrets could be a very expensive operation. It could be optimized, but no one has done so yet.

Also note that label selector queries don't make the request less resource intensive, it has to read (and decrypt!) every secret to see if it matches or not.

So for now, I strongly recommend confining secret lists to single namespaces.

krish7919 commented 3 years ago

I can confirm I see this on AWS EKS with about ~50 releases in a namespace and I use the command:

$ helm3 --kube-context XXX --namespace XXX list -a
Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR

$ helm3 version
version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}

EDIT:

I can confirm that this is a kube-apiserver issue; traces from the audit log in my cluster:

E0309 20:59:28.389140       1 wrap.go:32] apiserver panic'd on GET /api/v1/namespaces/XXX/secrets?labelSelector=owner%3Dhelm

I0309 20:59:28.389202       1 log.go:172] http2: panic serving 173.38.220.51:22316: killing connection/stream because serving request timed out and response had been started
goroutine 15138551 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc014f272d0, 0xc01f15bfaf, 0xc06f5a8780)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2118 +0x16b
panic(0x3bee6c0, 0xc000391590)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc01f15bce0, 0x1, 0x1)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x3bee6c0, 0xc000391590)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc038f573e0, 0xc07c576780)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:256 +0x1a9
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc035fa4840, 0x76c6500, 0xc03bbf7340, 0xc0239ab400)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:140 +0x2d7
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x76c6500, 0xc03bbf7340, 0xc0239ab300)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0xf3
net/http.HandlerFunc.ServeHTTP(0xc035f71950, 0x76c6500, 0xc03bbf7340, 0xc0239ab300)
    /usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x76c6500, 0xc03bbf7340, 0xc09b33b200)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x2b8
net/http.HandlerFunc.ServeHTTP(0xc035f71980, 0x76c6500, 0xc03bbf7340, 0xc09b33b200)
    /usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x76c6500, 0xc03bbf7340, 0xc09b33b200)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:46 +0x127
net/http.HandlerFunc.ServeHTTP(0xc035fa4860, 0x76badc0, 0xc014f272d0, 0xc09b33b200)
    /usr/local/go/src/net/http/server.go:1995 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc035f719b0, 0x76badc0, 0xc014f272d0, 0xc09b33b200)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc00285af70, 0x76badc0, 0xc014f272d0, 0xc09b33b200)
    /usr/local/go/src/net/http/server.go:2774 +0xa8
net/http.initNPNRequest.ServeHTTP(0xc01bf80380, 0xc00285af70, 0x76badc0, 0xc014f272d0, 0xc09b33b200)
    /usr/local/go/src/net/http/server.go:3323 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc06f5a8780, 0xc014f272d0, 0xc09b33b200, 0xc07bf68000)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2125 +0x89
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1859 +0x4f4
krish7919 commented 3 years ago

As per the comment in https://github.com/jetstack/cert-manager/issues/3229#issuecomment-772600164, this seems to be due to a suboptimal query being run by Helm. Is there any way we can optimize this?

We have just upgraded from Helm2 to Helm3 and have faced this within 3 days of the upgrade. :/

elghazal-a commented 3 years ago

We have no updates on our side, as we don't believe this is a Helm error so much as a Kubernetes control plane error. Right now, all of the complaints that I know of are specific to GCP. You might have better luck asking someone there about the issue.

To my knowledge EKS has never had this problem. I experienced it on AKS a year ago, and it has since been fixed. I know of no cases involving on-prem versions of Kubernetes.

So at this point, we believe the error to be specific to GKE's internal control plane implementation.

This is something we experience in some EKS clusters. We use Helm Terraform provider and we have those kind of issues specially when the resources state grow The error looks like

Error: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 11; INTERNAL_ERROR

As workaround, we moved from secret to configmap as storage backend. Now, we still experience the issue but less often

cobb-tx commented 3 years ago

I traced the problem to kube-apiserver . maybe is apiserver was unable to write a fallback JSON response: http: Handler timeout

I0323 11:54:40.268733 1 trace.go:116] Trace[1575763432]: "List etcd3" key:/secrets/testing00,resourceVersion:,limit:0,continue: (started: 2021-03-23 11:54:39.130390916 +0800 CST m=+2313349.748756297) (total time: 1.138303823s): Trace[1575763432]: [1.138303823s] [1.138303823s] END E0323 11:55:39.132838 1 writers.go:118] apiserver was unable to write a fallback JSON response: http: Handler timeout I0323 11:55:39.133987 1 trace.go:116] Trace[1802854454]: "List" url:/api/v1/namespaces/testing00/secrets (started: 2021-03-23 11:54:39.130320896 +0800 CST m=+2313349.748686231) (total time: 1m0.00363728s): Trace[1802854454]: [1.138447997s] [1.13838491s] Listing from storage done Trace[1802854454]: [1m0.003636268s] [58.865188271s] Writing http response done count:1154

mridu23 commented 3 years ago

We are encountering a similar behavior where helm hangs and no helm updates work either.

Error: unable to build kubernetes objects from release manifest: unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout exceeded while reading body

--- observed with 4 EKS clusters today ----

E0323 09:07:32.801092 1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started) goroutine 2585666292 [running]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3cbaa60, 0xc000408b70) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc014e2dc90, 0x1, 0x1) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x3cbaa60, 0xc000408b70) /usr/local/go/src/runtime/panic.go:679 +0x1b2 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(baseTimeoutWriter).timeout(0xc0482a2fa0, 0xc07f874aa0) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1cf k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(timeoutHandler).ServeHTTP(0xc00d481b40, 0x5230bc0, 0xc059dcf9d0, 0xc03ee51900) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x310 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x5230bc0, 0xc059dcf9d0, 0xc03ee51800) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:59 +0x121 net/http.HandlerFunc.ServeHTTP(0xc00d021ce0, 0x5230bc0, 0xc059dcf9d0, 0xc03ee51800) /usr/local/go/src/net/http/server.go:2036 +0x44 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x5230bc0, 0xc059dcf9d0, 0xc03ee51700) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x274 net/http.HandlerFunc.ServeHTTP(0xc00d021d70, 0x5230bc0, 0xc059dcf9d0, 0xc03ee51700) /usr/local/go/src/net/http/server.go:2036 +0x44 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x5230bc0, 0xc059dcf9d0, 0xc03ee51700) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8 net/http.HandlerFunc.ServeHTTP(0xc00d481b60, 0x5230bc0, 0xc059dcf9d0, 0xc03ee51700) /usr/local/go/src/net/http/server.go:2036 +0x44 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x2ca net/http.HandlerFunc.ServeHTTP(0xc00d481b80, 0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /usr/local/go/src/net/http/server.go:2036 +0x44 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x13e net/http.HandlerFunc.ServeHTTP(0xc00d481ba0, 0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /usr/local/go/src/net/http/server.go:2036 +0x44 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(APIServerHandler).ServeHTTP(0xc00d021e00, 0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51 net/http.serverHandler.ServeHTTP(0xc00d31ac40, 0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /usr/local/go/src/net/http/server.go:2831 +0xa4 net/http.initNPNRequest.ServeHTTP(0x523e0c0, 0xc060ba6b70, 0xc01bd24000, 0xc00d31ac40, 0x52239c0, 0xc03b8afaf8, 0xc03cb65500) /usr/local/go/src/net/http/server.go:3395 +0x8d k8s.io/kubernetes/vendor/golang.org/x/net/http2.(serverConn).runHandler(0xc0205eb980, 0xc03b8afaf8, 0xc03cb65500, 0xc04d0b29e0) /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2149 +0x9f created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1883 +0x4eb

lavalamp commented 3 years ago

"killing connection/stream because serving request timed out and response had been started"

This is proof that you're hitting the 60s global time out I mentioned previously.

The client needs a faster network connection or you have to list less data. There's no other solutions that don't boil down to making it faster or doing less work.

oakad commented 3 years ago

Most aws to aws traffic goes over 10GbE these days. How much faster do you expect the connection to be? :-)

technosophos commented 3 years ago

We have also seen this happen when an intermediate proxy (often in the cloud provider's control plane) is timing out. And this isn't always because of network speed, but because of changing network topology, security configurations, and a host of other reasons. The trick is to find out what on the network is causing the timeouts. In some cases, you may need to file issues with your upstream cloud provider. Testing with kubectl and curl are also good ways of trying to find the culprit.

oakad commented 3 years ago

In our experience, helm often works better on a reasonable network (average throughput/latency) then on the fancy cloud side network. This makes me think that we are not dealing with some timeout here, but with a genuine bug in helm implementation.

Or, possibly, a bug on the k8s api endpoint side.

arthur-c commented 3 years ago

For our EKS setup, helm list can't handle more than ~3000 versions/secrets.

Cleaning up old versions/secrets solved the issue (we had ~13 000). kubectl get secrets took only 10 seconds to list more than 13000 secrets, so I believe the issue is on the helm side.

technosophos commented 3 years ago

Really, large Helm installations should use the database backend. Kubernetes/Etcd servers are not capable of delivering high numbers of release records in a short amount of time.

chrisedrego commented 3 years ago

What helped me fix the issue was deleting the secrets that helm creates using a simple script NAMESPACE=monitor kubectl get secrets -n $NAMESPACE --no-headers=true | awk '/sh.helm.release/{print $1}'| xargs kubectl delete -n $NAMESPACE secrets

rramadoss4 commented 3 years ago

We faced same issue in EKS with cert-manager. We tried to clean up the cluster by deleting some old helm releases which could delete old secrets. Ref: https://github.com/jetstack/cert-manager/issues/3229

mxchist commented 3 years ago

If your ansible playbook has an option helm_kubectl_context_is_admin, try to change it from true to false.

AwateAkshay commented 3 years ago

check you internet cable connection. Restarting my wifi helped me out.

UmairHassanKhan commented 3 years ago

I am facing this issue when installing helm chart of prometheus did anyone resolved the issue?

AwateAkshay commented 3 years ago

please. paste the logs @UmairHassanKhan

project-administrator commented 3 years ago

@UmairHassanKhan Yes, just set the "--history-max" to some low value if you're using helm cli or "max_history" if you're using terraform helm provider. Setting this value to "10" helped me. Also, you can delete old history records manually with kubectl.

sjthespian commented 3 years ago

I'm seeing this as well, eventually after running a helm upgrade 20+ times, it succeeded. In my case, there isn't anything I can do about the network speed, it's over a satellite link. Is there any hope of getting a timeout option added for situations where there are a large number of secrets/versions or slow network links? It doesn't look like the existing --timeout option covers that case.

gosoon commented 3 years ago

"killing connection/stream because serving request timed out and response had been started"

This is proof that you're hitting the 60s global time out I mentioned previously.

The client needs a faster network connection or you have to list less data. There's no other solutions that don't boil down to making it faster or doing less work.

Where is the 60s global time out set? Is there any way to change it?

pacoxu commented 3 years ago

kube-apiserver has a --request-timeout flag. Not sure if @lavalamp means it.

--request-timeout duration     Default: 1m0s
--
  | An optional field indicating the duration a handler must keep a request open before timing it out. This is the default request timeout for requests but may be overridden by flags such as --min-request-timeout for specific types of requests.
lavalamp commented 3 years ago

Yes, that's the global time out, and I didn't mention changing it as an option because the vast majority of users don't have access to be changing flags on their cluster's kube-apiserver.

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

jainpratik163 commented 2 years ago

any solution abt this issue as we are also facing issue.

AllenShen commented 2 years ago

facing the same issue in multiple clusters.

promzeus commented 2 years ago

Увеличиваем таймаут --request-timeout=1m0s до 2m0s и все заработает! /etc/kubernetes/manifests/kube-apiserver.yaml

joejulian commented 2 years ago

This is due to how the api-server uses etcd. When there are a large number of any resource, secrets in this case, the api-server query retrieves ALL secrets from etcd. When you have more than a few hundred, this can cause some pretty severe latency. We were able to completely bring down a control plane with only 200 secrets. This isn't directly a helm problem, though it was exacerbated when the history-max was unlimited.

Now history-max defaults to 10 which should be sufficient, in most cases, to prevent this specific issue. If you're experiencing this issue and haven't upgraded since the default was changed, either delete your old release secrets or do an upgrade with a version since v3.4.0 when that default was changed.

krish7919 commented 2 years ago

This is due to how the api-server uses etcd.

I am skeptical of this statement, as kubectl get secrets and kubectl describe secrets works for 2000+ Secrets in my cluster, but a helm list would fail.

lavalamp commented 2 years ago

apiserver offers pagination. You do not have to read all the objects in a single request. It is very hard on the system. We are likely to incentivize the use of pagination more over time by limiting unpaginated requests in various ways.

krish7919 commented 2 years ago

I have spent about 4 hours so far fixing this issue. Here're the details:

Default Helm

$ helm version
version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}

Failure with the system default Helm

$ helm --kube-context ctx list --all --deployed --failed --date -n ns --max 1000
Error: list: failed to list: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR

Code changes

$ gd
diff --git a/pkg/storage/driver/secrets.go b/pkg/storage/driver/secrets.go
index 2e8530d0..f3694cfc 100644
--- a/pkg/storage/driver/secrets.go
+++ b/pkg/storage/driver/secrets.go
@@ -35,8 +35,12 @@ import (

 var _ Driver = (*Secrets)(nil)

-// SecretsDriverName is the string name of the driver.
-const SecretsDriverName = "Secret"
+const (
+       // SecretsDriverName is the string name of the driver.
+       SecretsDriverName = "Secret"
+       // ListPaginationLimit is the number of Secrets we fetch in a single API call.
+       ListPaginationLimit = int64(300)
+)

 // Secrets is a wrapper around an implementation of a kubernetes
 // SecretsInterface.
@@ -78,15 +82,36 @@ func (secrets *Secrets) Get(key string) (*rspb.Release, error) {
 // List fetches all releases and returns the list releases such
 // that filter(release) == true. An error is returned if the
 // secret fails to retrieve the releases.
+// We read `ListPaginationLimit` Secrets at a time so as not to overwhelm the
+// `api-server` in a cluster with many releases; fixes
+// https://github.com/helm/helm/issues/7997
 func (secrets *Secrets) List(filter func(*rspb.Release) bool) ([]*rspb.Release, error) {
        lsel := kblabels.Set{"owner": "helm"}.AsSelector()
-       opts := metav1.ListOptions{LabelSelector: lsel.String()}
+       opts := metav1.ListOptions{LabelSelector: lsel.String(), Limit: ListPaginationLimit}

+       // Perform an initial list
        list, err := secrets.impl.List(context.Background(), opts)
        if err != nil {
                return nil, errors.Wrap(err, "list: failed to list")
        }

+       // Fetch more results from the server by making recursive paginated calls
+       isContinue := list.Continue
+       for isContinue != "" {
+               secrets.Log("list: fetched %d secrets, more to fetch..\n", ListPaginationLimit)
+               opts = metav1.ListOptions{LabelSelector: lsel.String(), Limit: ListPaginationLimit, Continue: isContinue}
+               batch, err := secrets.impl.List(context.Background(), opts)
+               if err != nil {
+                       return nil, errors.Wrap(err, "list: failed to perform paginated listing")
+               }
+
+               // Append the results to the initial list
+               list.Items = append(list.Items, batch.Items...)
+
+               isContinue = batch.Continue
+       }
+       secrets.Log("list: fetched %d releases\n", len(list.Items))
+
        var results []*rspb.Release

        // iterate over the secrets object list

Build custom Helm

$ make && stat bin/helm

Custom Helm with fix for listing

$ ./bin/helm version
version.BuildInfo{Version:"v3.8+unreleased", GitCommit:"65d8e72504652e624948f74acbba71c51ac2e342", GitTreeState:"dirty", GoVersion:"go1.17.2"}

Success with the custom Helm with the changes as above

$ ./bin/helm --debug --kube-context ctx list --all --deployed --failed --date -n ns --max 1000
secrets.go:101: [debug] list: fetched 300 secrets, more to fetch..

secrets.go:101: [debug] list: fetched 300 secrets, more to fetch..

secrets.go:101: [debug] list: fetched 300 secrets, more to fetch..

secrets.go:101: [debug] list: fetched 300 secrets, more to fetch..

secrets.go:101: [debug] list: fetched 300 secrets, more to fetch..

secrets.go:113: [debug] list: fetched 1621 releases
...
...
<list of releases in namespace `ns`>

Note: The in-built UTs are currently failing - I am yet to modify them. I will fix them or if someone can help me fix them asap, I can open a PR and get this ready for merge.

EDIT: The UTs are good now, PR out.

github-actions[bot] commented 2 years ago

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

eboboshka commented 2 years ago

Is there any update?

DmitriyStoyanov commented 2 years ago

any update on that? faced with the issue with helm version 3.9.4, so issue still exist