Closed day0ops closed 2 months ago
This issue actually spawned two production incident last week. so this one is pretty high in priority for Xero. it would be great if this can be looked at sooner.
Related issue https://solo-io.zendesk.com/agent/tickets/2123
There are a few dimensions to this issue, and I want to make sure we break them down, so we can better understand the challenges and afterwards help come up with a desired solution. To me there are a few aspects at play:
Can you help me understand how this plays a role here? I would imagine that cert-manager (as an example) doesn't delete the secret, but instead modifies it when there is an update. I'm happy to learn more about the internals about what takes place, but can you share what the expected order of events is and where in there Secrets are being deleted?
I agree that we should not support this. There are a few cases that I think are worth considering:
I observed a similar behavior when configuring a single VS. I believe this is because if you invalidate the VS (and thus the filter chain), the listener is no longer valid and is not updated. As a result, the "old", previously working tls config is still stored in memory in the proxy. When a new pod starts up, it requests the config from gloo and doesn't pick up any listeners, because the xds cache has no valid listeners.
However, I noticed that when I had multiple virtual services, invalidating one resulted in the listener being updated by removing the now invalid filter chain. However, since there was >0 valid filter chains, the listener kept functioning. This to me feels like the proper behavior, but the open questions is how we want to handle the case where a listener has 0 valid virtual hosts (because each has an invalid tls config)
To my understanding, this is the current behavior. It feels like the open questions are:
I feel that there is some context missing here as you mention secret being deleted which I don't see in this issue (maybe part of the zendesk ticket that is linked but I do not have access to it), so our situation might be a bit different but I think this is similar to an issue we're having. Let me know if you prefer that I create a separate issue.
AFAIK this is only an issue when creating a new VirtualService and associated cert, not on deletion or updates. What happens is that Gloo processes the VirtualServices almost immediately, while cert-manager needs a certain amount of time to fetch a certificate (it can take > 1h in case of rate limiting, or even be impossible to fetch at all if there are issues answering the challenge). Cert-manager only creates the kubernetes secret when the certificate has been properly issued, therefore there is almost always some time during which Gloo sees new VirtualServices as invalid.
The way nginx-ingress solves this problem is by serving a default self-signed certificate if the secret cannot be found.
I haven't had this issue, it seems reasonable to have a validation webhook on secret deletion, but as you mentioned with a flag to disable it for backwards-compatibility.
One issue I have with the validation webhook way is that it introduces a notion of dependency (the VirtualService has to be deleted first, then the secret). This can make simple workflows using a kubectl apply/delete
break or be flaky as there's no way to guarantee the order objects will be created/deleted in.
Note that this is also true at creation time with some of the existing validation hooks. Here's an example of applying a single yaml file that contains a VirtualService and a cert-manager Certificate
$ kubectl apply -f vs-test.yaml
Error from server: error when creating "test.mydomain.com.yaml": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.VirtualService failed: validating *v1.VirtualService name:"test.mydomain.com" namespace:"gloo-test": failed to validate Proxy with Gloo validation server: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret test.mydomain.com]
To work around this one would have to either:
validation.alwaysAcceptResources
to true, but that means that we will also accept truly malformed VirtualServices.Serving a default cert when the secret cannot be found would be a simple way to solve this issue. The validation webhook could maybe just throw a warning instead of failing?
Merging these 2 questions, as I'm not sure in which one this falls in to but what happened to us was that the gloo Gateway was in an error state (because the certificate couldn't be fetched and thus the tls secret couldn't be found).
Everything still worked fine until the gloo
pod restarted (not gateway-proxy
), after which the new pod decided to drop the whole listener (stripping invalid listener from proxy
)
$ curl localhost:19000/listeners
prometheus_listener::0.0.0.0:8081
listener-::-8080::[::]:8080
(This was tested on gloo OSS 1.13.30)
This can be easily reproduced by creating a VirtualService referencing a secret that doesn't exist (with validation.alwaysAcceptResources
set to true). Then restart the gloo pod.
I do have other https VirtualServices that have a good config, so it's not that we ended up with 0 valid routes. This is IMO the critical part of the issue, where a single badly configured VS can take everything down with it.
Here's the logs on gateway-proxy when this happens (I removed the deprecation warnings)
[2023-11-01 00:31:15.180][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8080'
[2023-11-01 00:31:15.222][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8443'
[2023-11-01 00:31:15.228][7][info][config] [external/envoy/source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
[2023-11-01 00:40:27.700][7][warning][config] [external/envoy/source/common/config/grpc_stream.h:160] StreamAggregatedResources gRPC config stream closed: 13,
[2023-11-01 00:40:28.058][7][info][upstream] [external/envoy/source/server/lds_api.cc:63] lds: remove listener 'listener-::-8443'
[2023-11-01 00:46:15.228][7][info][main] [external/envoy/source/server/drain_manager_impl.cc:171] shutting down parent after drain
Logs for the gloo pod (note that the timestamps won't match as I had to restart the pod again to capture the logs)
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.468Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:92","msg":"begin sync 5563496949759613014 (56 virtual services, 6 gateways, 30 route tables)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.470Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:132","msg":"desired proxy name:\"gateway-proxy\" namespace:\"gloo-system\"","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.476Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"validation/server.go:189","msg":"received proxy validation request","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.493Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.496Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8443","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"warn","ts":"2023-11-01T01:14:13.548Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"reconciler/proxy_reconciler.go:164","msg":"stripping invalid listener from proxy","version":"1.13.30","proxy":"name:\"gateway-proxy\" namespace:\"gloo-system\"","listener":"listener-::-8443"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.549Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:104","msg":"end sync 5563496949759613014","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.570Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:80","msg":"begin sync 11661502467246133202 (1 proxies, 471 upstreams, 579 endpoints, 197 secrets, 355 artifacts, 0 auth configs, 0 rate limit configs, 0 graphql apis)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.583Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:165","msg":"Setting xDS Snapshot","version":"1.13.30","key":"gloo-system~gateway-proxy","clusters":469,"listeners":1,"routes":1,"endpoints":451}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:175","msg":"end sync 11661502467246133202","version":"1.13.30"}
I just found out that cert-manager is able to issue a temporary certificate by setting the cert-manager.io/issue-temporary-certificate
annotation to "true"
on the cert.
This works nicely as a workaround, and it actually makes a lot more sense for cert-manager to do that rather than gloo.
I think there's still work to do on the gloo side:
alwaysAccept
to true to create VirtualServices with cert-manager, I would rather keep that to false to catch other misconfiguration issuesAnother related issue https://solo-io.zendesk.com/agent/tickets/3753
@sam-heilbron to look at this issue for scope/estimation.
Zendesk ticket #3753 has been linked to this issue.
I have been able to reproduce this with the following steps (assuming running cluster):
Create the cert-manager installation
kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
Create the gloo installation
helm repo add gloo https://storage.googleapis.com/solo-public-helm
helm repo update
helm install -n gloo-system gloo gloo/gloo --set gateway.persistProxySpec=true --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.allowWarnings=false
Create the cert-manager pre-requisites
kubectl apply -f - << EOF
apiVersion: v1
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-selfsigned-ca
namespace: cert-manager
spec:
isCA: true
commonName: my-selfsigned-ca
secretName: root-secret
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: my-ca-issuer
spec:
ca:
secretName: root-secret
EOF
Create a Certificate and a virtual service which consumes its resultant secret
kubectl apply -f - << EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: test-1.solo.io
namespace: gloo-system
spec:
secretName: test-1.solo.io
dnsNames:
- test-1.solo.io
issuerRef:
name: my-ca-issuer
kind: ClusterIssuer
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: direct-response-ssl-1
namespace: gloo-system
spec:
virtualHost:
domains:
- test-1.solo.io
- test-1.solo.io:8443
routes:
- matchers:
- prefix: /
directResponseAction:
status: 200
body: you did it, number 1!
sslConfig:
secretRef:
name: test-1.solo.io
namespace: gloo-system
oneWayTls: true
sniDomains: [ "test-1.solo.io" ]
EOF
At this point, note that we have a validation error since cert-manager was not given time to create the secret before gloo attempted to reconcile the VS. Running the exact same command again will succeed IFF the cert (and its secret) were created first, which they should have been due to the ordering of the resources in the command.
certificate.cert-manager.io/test-1.solo.io created
Error from server: error when creating "STDIN": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.VirtualService failed: 1 error occurred:
* Validating *v1.VirtualService failed: validating *v1.VirtualService name:"direct-response-ssl-1" namespace:"gloo-system": 1 error occurred:
* failed to validate Proxy [namespace: gloo-system, name: gateway-proxy] with gloo validation: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret gloo-system.test-1.solo.io
Port-forward the gateway-proxy in a separate terminal
kubectl port-forward -n gloo-system deploy/gateway-proxy 8443
Append your hosts file with the following lines for the hosts we will be using
127.0.0.1 test-1.solo.io
127.0.0.1 test-2.solo.io
Execute a curl request showing that the proxy is responding as expected
curl -vk https://test-1.solo.io:8443
Delete the Certificate resource and secret NOTE: this requires disabling validation. Currently, validation settings should prevent this particular failure mode, and when used in conjunction with the cert-manager configuration to enable temporary certs during issuance, the problem of a missing secret should be nullified. ...unless validation has to be disabled for cert-manager to work properly...
kubectl patch settings -n gloo-system default --type merge -p '{"spec":{"gateway":{"validation":{"alwaysAccept":true}}}}'
sleep 1
kubectl delete Certificate -n gloo-system test-1.solo.io && kubectl delete secret -n gloo-system test-1.solo.io
watch curl -vk https://test-1.solo.io:8443
We can see here that the VirtualService, where the actual configuration issue* is occurring, shows Accepted
with a subresource status for the Proxy showing Accepted
, the Gateway which consumes the VirtualService shows Rejected
with a subresource status for the Proxy showing Accepted
, and the Proxy shows Rejected
. At the very least this is inconsistent status reporting that misleads where the error is actually occurring.
This work will be released in the following versions:
Gloo Edge: v1.18.0-beta16 v1.17.5 v1.16.20 v1.15.31
Gloo Edge Enterprise: v1.18.0-beta1 v1.17.2 v1.16.15 v1.15.22
This work will not be available in 1.14 versions of Gloo Edge because the validating webhook behavior was significantly different, and I was unable to reproduce the issue on 1.14
This work is merged into all relevant versions as detailed in the comment above.
Gloo Edge Version
1.11.x (latest stable)
Kubernetes Version
1.23.x
Describe the bug
At times when certificate issuance is slow (could be because the DNS challenge isnt solved) in cert-manager (i.e. the TLS secret is not available yet), Gloo Edge will go into bad state (when mounting a TLS). This is specially true if the proxy restarts as this causes HTTPS listener to disappear and will cause an outage.
The listener should keep work functioning for all connections where a proper cert is existing.
Steps to reproduce the bug
Observed when deleting a referenced certificate.
Expected Behavior
Expect the proxy to behave good for all other connections where there is a certificate.
Additional Context
No response
┆Issue is synchronized with this Asana task by Unito