solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

glooctl check hangs at "Checking secrets..." #3785

Open dshackith opened 3 years ago

dshackith commented 3 years ago

Describe the bug Running glooctl check -n custom-namespace hangs at "Checking secrets"

To Reproduce Steps to reproduce the behavior:

  1. install Gloo 1.4.15 via helm 2 with custom namespace (not using gloo-system)
  2. glooctl check -n my-namespace
  3. See error

Expected behavior I expect glooctl check to complete or fail, not hang

Additional context

Gloo was originally installed 1.3.26 and was upgraded to 1.4.15 It is not clear if using a custom namespace is an issue, but that is what we are using.

NelsonJeppesen commented 3 years ago

I have this issue but with some differences

NelsonJeppesen commented 3 years ago

Works if I exclude secrets

$ glooctl check -x secrets        
Checking deployments... OK
Checking pods... OK
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking virtual services... OK
Checking gateways... OK
Checking proxies... OK
No problems detected.
Skipping Gloo Instance check -- Gloo Federation not detected
rustrial commented 3 years ago

Same here with gloo-edge 1.6.17.

rustrial commented 2 years ago

Still an issue with 1.6.37!

anessi commented 2 years ago

This issues seems to mainly pop-up for configurations where we have gloo set up for many environments (around 60 virtual services, 60 secrets (ssl certs)). This problem does not (usually) appear on namespaces where we have the same gloo version running but less virtual services (around 10 virtual services, 10 secrets). It also seems to depend on the load of the API server / cluster. Sometimes we have a lot of these errors, sometimes less. In general the secrets verification step can be very slow (e.g. minutes, rather than seconds).

Checking secrets... E1112 13:15:08.482036   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482033   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482032   54982 request.go:1001] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482036   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482619   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482716   54982 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.6/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body)

When we run glooctl check --exclude secrets it obviously does not report the issue as mentioned above.

NelsonJeppesen commented 2 years ago

This issues seems to mainly pop-up for configurations where we have gloo set up for many environments (around 60 virtual services, 60 secrets (ssl certs)). This problem does not (usually) appear on namespaces where we have the same gloo version running but less virtual services (around 10 virtual services, 10 secrets). It also seems to depend on the load of the API server / cluster. Sometimes we have a lot of these errors, sometimes less. In general the secrets verification step can be very slow (e.g. minutes, rather than seconds).

Checking secrets... E1112 13:15:08.482036   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482033   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482032   54982 request.go:1001] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482036   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482619   54982 request.go:1001] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1112 13:15:08.482716   54982 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.6/tools/cache/reflector.go:156: Failed to watch *v1.Secret: failed to list *v1.Secret: unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body)

When we run glooctl check --exclude secrets it obviously does not report the issue as mentioned above.

I just took the latest glooctl v1.10.0-beta8 and got something simular. This issue seems to have become worse when we moved to AWS IAM authenication for kube. Almost feels like it's too many requests for IAM auth / second, but just a gut feelling

❯ ./glooctl-linux-amd64.2 check
Checking deployments... OK
Checking pods... OK
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking VirtualHostOptions... WARN: VirtualHostOption CRD has not been registered
Checking RouteOptions... WARN: RouteOption CRD has not been registered
Checking secrets... W1125 22:59:25.411878   32643 transport.go:260] Unable to cancel request for *exec.roundTripper
E1125 22:59:25.411958   32643 request.go:1011] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
W1125 22:59:25.411960   32643 transport.go:260] Unable to cancel request for *exec.roundTripper
E1125 22:59:25.412090   32643 request.go:1011] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while readingbody)
E1125 22:59:25.412092   32643 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.9/tools/cache/reflector.go:167: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap:unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E1125 22:59:25.412194   32643 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.9/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: unexpected error when reading response body. Please retry. Original error: context deadline exceeded (Client.Timeout or context cancellation while reading body)
W1125 22:59:25.412209   32643 transport.go:260] Unable to cancel request for *exec.roundTripper
E1125 22:59:25.412241   32643 request.go:1011] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while readingbody)
E1125 22:59:25.412290   32643 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.9/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: unexpected error when reading response body. Please retry. Original error: context deadline exceeded (Client.Timeout or context cancellation while reading body)
W1125 22:59:25.414128   32643 transport.go:260] Unable to cancel request for *exec.roundTripper
E1125 22:59:25.414171   32643 request.go:1011] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while readingbody)
...
NelsonJeppesen commented 2 years ago

We don't have that many services thought

❯ k get secrets -A |grep tls |wc -l
13

❯ k get virtualservices.gateway.solo.io -A | wc -l
33

but a decent amount of secrets

❯ k get secrets -A |wc -l
1154
chrisgaun commented 2 years ago

OK, we will fix this. Prioritizing now. Will make this part of the next release iteration of Jan - March

sam-heilbron commented 2 years ago

This issue sounds similar to https://github.com/solo-io/gloo/issues/5061. Our initial findings indicated that https://github.com/kubernetes/kubernetes/issues/91913 was the source. We upgraded our k8s libraries to a version containing a fix and this is available since Gloo Edge OSS 1.10.0-beta12. @dshackith @NelsonJeppesen could you try using a later version of glooctl to verify whether our updates resolved this particular issue and comment here with the outcome?

NelsonJeppesen commented 2 years ago

@sam-heilbron nope

❯ ./glooctl-linux-amd64.1 version
Client: {"version":"1.11.0-beta3"}
❯ ./glooctl-linux-amd64.1 check
Checking deployments... OK
Checking pods... OK
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking VirtualHostOptions... OK
Checking RouteOptions... OK
Checking secrets... W0108 20:48:15.990200    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
W0108 20:48:15.990243    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
W0108 20:48:15.990283    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
E0108 20:48:15.990290    7364 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0108 20:48:15.990314    7364 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
W0108 20:48:15.990354    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
W0108 20:48:15.990371    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
E0108 20:48:15.990381    7364 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failedto watch *v1.Service: failed to list *v1.Service: unexpected error when reading response body. Please retry. Originalerror: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0108 20:48:15.990390    7364 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failedto watch *v1.Pod: failed to list *v1.Pod: unexpected error when reading response body. Please retry. Original error: net/http: request canceled (Client.Timeout or context cancellation while reading body)
W0108 20:48:15.990404    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
W0108 20:48:15.990416    7364 transport.go:288] Unable to cancel request for *exec.roundTripper
...

tried with 1.10.0-beta13 as well just in-case this was only merged to 10.10.0-beta12+

github-actions[bot] commented 9 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

anessi commented 9 months ago

The secrets part works for me now on Gloo EE version 1.15.8, Gloo OSS 1.15.17, Kubernetes 1.26.9.

$ glooctl check
Checking deployments... OK
Checking pods... OK
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking VirtualHostOptions... OK
Checking RouteOptions... OK
Checking secrets... OK
Checking virtual services... OK
Checking gateways... OK
Checking proxies...

However, it hangs on proxies... even though the server version is matching the glooctl version. Looks like this is another issue.

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.