VirtualGateways should not break with a missing TLS certificate

day0ops commented 2 years ago

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

At times when certificate issuance is slow (could be because the DNS challenge isnt solved) in cert-manager (i.e. the TLS secret is not available yet), Gloo Edge will go into bad state (when mounting a TLS). This is specially true if the proxy restarts as this causes HTTPS listener to disappear and will cause an outage.

The listener should keep work functioning for all connections where a proper cert is existing.

Steps to reproduce the bug

Observed when deleting a referenced certificate.

Expected Behavior

Expect the proxy to behave good for all other connections where there is a certificate.

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

dinukxx commented 2 years ago

This issue actually spawned two production incident last week. so this one is pretty high in priority for Xero. it would be great if this can be looked at sooner.

day0ops commented 1 year ago

sam-heilbron commented 1 year ago

There are a few dimensions to this issue, and I want to make sure we break them down, so we can better understand the challenges and afterwards help come up with a desired solution. To me there are a few aspects at play:

Delayed certificate issuance
Deleting secrets that are valid and referenced in a VirtualService to server-side TLS
Proxy restarts causing differing behavior in the new pod from the previous one
The listener should keep work functioning for all connections where a proper cert is existing.

Delayed Certificate Issuance

Can you help me understand how this plays a role here? I would imagine that cert-manager (as an example) doesn't delete the secret, but instead modifies it when there is an update. I'm happy to learn more about the internals about what takes place, but can you share what the expected order of events is and where in there Secrets are being deleted?

Deleting Secrets That are Valid

I agree that we should not support this. There are a few cases that I think are worth considering:

Prevent the deletion through a validation API (https://github.com/solo-io/gloo/issues/8001)
What should the behavior or contract of Gloo Edge, in the case where validation is disabled. Do you have thoughts on this?

Proxy restarts causing differing behavior in the new pod from the previous one

I observed a similar behavior when configuring a single VS. I believe this is because if you invalidate the VS (and thus the filter chain), the listener is no longer valid and is not updated. As a result, the "old", previously working tls config is still stored in memory in the proxy. When a new pod starts up, it requests the config from gloo and doesn't pick up any listeners, because the xds cache has no valid listeners.

However, I noticed that when I had multiple virtual services, invalidating one resulted in the listener being updated by removing the now invalid filter chain. However, since there was >0 valid filter chains, the listener kept functioning. This to me feels like the proper behavior, but the open questions is how we want to handle the case where a listener has 0 valid virtual hosts (because each has an invalid tls config)

The listener should keep work functioning for all connections where a proper cert is existing

To my understanding, this is the current behavior. It feels like the open questions are:

What should happen when a secret containing a secret is deleted?
What is causing that Secret to be deleted?
Can we prevent it from occurring?

nil-scan commented 1 year ago

I feel that there is some context missing here as you mention secret being deleted which I don't see in this issue (maybe part of the zendesk ticket that is linked but I do not have access to it), so our situation might be a bit different but I think this is similar to an issue we're having. Let me know if you prefer that I create a separate issue.

Delayed Certificate Issuance

AFAIK this is only an issue when creating a new VirtualService and associated cert, not on deletion or updates. What happens is that Gloo processes the VirtualServices almost immediately, while cert-manager needs a certain amount of time to fetch a certificate (it can take > 1h in case of rate limiting, or even be impossible to fetch at all if there are issues answering the challenge). Cert-manager only creates the kubernetes secret when the certificate has been properly issued, therefore there is almost always some time during which Gloo sees new VirtualServices as invalid.

The way nginx-ingress solves this problem is by serving a default self-signed certificate if the secret cannot be found.

Deleting Secrets That are Valid

I haven't had this issue, it seems reasonable to have a validation webhook on secret deletion, but as you mentioned with a flag to disable it for backwards-compatibility.

One issue I have with the validation webhook way is that it introduces a notion of dependency (the VirtualService has to be deleted first, then the secret). This can make simple workflows using a kubectl apply/delete break or be flaky as there's no way to guarantee the order objects will be created/deleted in.

Note that this is also true at creation time with some of the existing validation hooks. Here's an example of applying a single yaml file that contains a VirtualService and a cert-manager Certificate

$ kubectl apply -f vs-test.yaml

Error from server: error when creating "test.mydomain.com.yaml": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.VirtualService failed: validating *v1.VirtualService name:"test.mydomain.com"  namespace:"gloo-test": failed to validate Proxy with Gloo validation server: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret test.mydomain.com]

To work around this one would have to either:

Retry later (once the cert has been fetched)
Use another tool such as terraform to manage the dependency
Set validation.alwaysAcceptResources to true, but that means that we will also accept truly malformed VirtualServices.

Serving a default cert when the secret cannot be found would be a simple way to solve this issue. The validation webhook could maybe just throw a warning instead of failing?

Proxy restarts causing differing behavior in the new pod from the previous one

The listener should keep work functioning for all connections where a proper cert is existing

Merging these 2 questions, as I'm not sure in which one this falls in to but what happened to us was that the gloo Gateway was in an error state (because the certificate couldn't be fetched and thus the tls secret couldn't be found). Everything still worked fine until the gloo pod restarted (not gateway-proxy), after which the new pod decided to drop the whole listener (stripping invalid listener from proxy)

$ curl localhost:19000/listeners

prometheus_listener::0.0.0.0:8081
listener-::-8080::[::]:8080

(This was tested on gloo OSS 1.13.30)

This can be easily reproduced by creating a VirtualService referencing a secret that doesn't exist (with validation.alwaysAcceptResources set to true). Then restart the gloo pod.

I do have other https VirtualServices that have a good config, so it's not that we ended up with 0 valid routes. This is IMO the critical part of the issue, where a single badly configured VS can take everything down with it.

Here's the logs on gateway-proxy when this happens (I removed the deprecation warnings)

[2023-11-01 00:31:15.180][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8080'
[2023-11-01 00:31:15.222][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8443'
[2023-11-01 00:31:15.228][7][info][config] [external/envoy/source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
[2023-11-01 00:40:27.700][7][warning][config] [external/envoy/source/common/config/grpc_stream.h:160] StreamAggregatedResources gRPC config stream closed: 13, 
[2023-11-01 00:40:28.058][7][info][upstream] [external/envoy/source/server/lds_api.cc:63] lds: remove listener 'listener-::-8443'
[2023-11-01 00:46:15.228][7][info][main] [external/envoy/source/server/drain_manager_impl.cc:171] shutting down parent after drain

Logs for the gloo pod (note that the timestamps won't match as I had to restart the pod again to capture the logs)

gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.468Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:92","msg":"begin sync 5563496949759613014 (56 virtual services, 6 gateways, 30 route tables)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.470Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:132","msg":"desired proxy name:\"gateway-proxy\" namespace:\"gloo-system\"","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.476Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"validation/server.go:189","msg":"received proxy validation request","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.493Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.496Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8443","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"warn","ts":"2023-11-01T01:14:13.548Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"reconciler/proxy_reconciler.go:164","msg":"stripping invalid listener from proxy","version":"1.13.30","proxy":"name:\"gateway-proxy\" namespace:\"gloo-system\"","listener":"listener-::-8443"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.549Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:104","msg":"end sync 5563496949759613014","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.570Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:80","msg":"begin sync 11661502467246133202 (1 proxies, 471 upstreams, 579 endpoints, 197 secrets, 355 artifacts, 0 auth configs, 0 rate limit configs, 0 graphql apis)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.583Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:165","msg":"Setting xDS Snapshot","version":"1.13.30","key":"gloo-system~gateway-proxy","clusters":469,"listeners":1,"routes":1,"endpoints":451}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:175","msg":"end sync 11661502467246133202","version":"1.13.30"}

nil-scan commented 1 year ago

I just found out that cert-manager is able to issue a temporary certificate by setting the cert-manager.io/issue-temporary-certificate annotation to "true" on the cert.

This works nicely as a workaround, and it actually makes a lot more sense for cert-manager to do that rather than gloo.

I think there's still work to do on the gloo side:

Stripping a whole listener for a bad vs configuration shouldn't happen. At most it should remove the vs itself, ideally keep the last known good config whenever possible.
We need to set alwaysAccept to true to create VirtualServices with cert-manager, I would rather keep that to false to catch other misconfiguration issues
There’s still the underlying possibility that the secret won’t be created for whatever reason, and lead to a full outage

AkshayAdsul commented 6 months ago

Another related issue https://solo-io.zendesk.com/agent/tickets/3753

nrjpoddar commented 4 months ago

@sam-heilbron to look at this issue for scope/estimation.

soloio-bot commented 4 months ago

Zendesk ticket #3753 has been linked to this issue.

jbohanon commented 3 months ago

Reproduction

I have been able to reproduce this with the following steps (assuming running cluster):

Create the cert-manager installation

kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml

Create the gloo installation

helm repo add gloo https://storage.googleapis.com/solo-public-helm
helm repo update
helm install -n gloo-system gloo gloo/gloo --set gateway.persistProxySpec=true --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.allowWarnings=false

Create the cert-manager pre-requisites

kubectl apply -f - << EOF
apiVersion: v1
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-selfsigned-ca
namespace: cert-manager
spec:
isCA: true
commonName: my-selfsigned-ca
secretName: root-secret
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: my-ca-issuer
spec:
ca:
secretName: root-secret
EOF

Create a Certificate and a virtual service which consumes its resultant secret

kubectl apply -f - << EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: test-1.solo.io
namespace: gloo-system
spec:
secretName: test-1.solo.io
dnsNames:
- test-1.solo.io
issuerRef:
name: my-ca-issuer
kind: ClusterIssuer
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: direct-response-ssl-1
namespace: gloo-system
spec:
virtualHost:
domains:
  - test-1.solo.io
  - test-1.solo.io:8443
routes:
- matchers:
   - prefix: /
  directResponseAction:
    status: 200
    body: you did it, number 1!
sslConfig:
secretRef:
  name: test-1.solo.io
  namespace: gloo-system
oneWayTls: true
sniDomains: [ "test-1.solo.io" ]
EOF

At this point, note that we have a validation error since cert-manager was not given time to create the secret before gloo attempted to reconcile the VS. Running the exact same command again will succeed IFF the cert (and its secret) were created first, which they should have been due to the ordering of the resources in the command.

certificate.cert-manager.io/test-1.solo.io created
Error from server: error when creating "STDIN": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.VirtualService failed: 1 error occurred:
    * Validating *v1.VirtualService failed: validating *v1.VirtualService name:"direct-response-ssl-1" namespace:"gloo-system": 1 error occurred:
    * failed to validate Proxy [namespace: gloo-system, name: gateway-proxy] with gloo validation: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret gloo-system.test-1.solo.io

Port-forward the gateway-proxy in a separate terminal

kubectl port-forward -n gloo-system deploy/gateway-proxy 8443

Append your hosts file with the following lines for the hosts we will be using
```
127.0.0.1 test-1.solo.io
127.0.0.1 test-2.solo.io
```
Execute a curl request showing that the proxy is responding as expected
```
curl -vk https://test-1.solo.io:8443
```
Delete the Certificate resource and secret NOTE: this requires disabling validation. Currently, validation settings should prevent this particular failure mode, and when used in conjunction with the cert-manager configuration to enable temporary certs during issuance, the problem of a missing secret should be nullified. ...unless validation has to be disabled for cert-manager to work properly...

kubectl patch settings -n gloo-system default --type merge -p '{"spec":{"gateway":{"validation":{"alwaysAccept":true}}}}'
sleep 1
kubectl delete Certificate -n gloo-system test-1.solo.io && kubectl delete secret -n gloo-system test-1.solo.io

Watch for the proxy to eventually stop serving
```
watch curl -vk https://test-1.solo.io:8443
```

An odd state

We can see here that the VirtualService, where the actual configuration issue* is occurring, shows Accepted with a subresource status for the Proxy showing Accepted, the Gateway which consumes the VirtualService shows Rejected with a subresource status for the Proxy showing Accepted, and the Proxy shows Rejected. At the very least this is inconsistent status reporting that misleads where the error is actually occurring.

jbohanon commented 2 months ago

This work will be released in the following versions:

Gloo Edge: v1.18.0-beta16 v1.17.5 v1.16.20 v1.15.31

Gloo Edge Enterprise: v1.18.0-beta1 v1.17.2 v1.16.15 v1.15.22

This work will not be available in 1.14 versions of Gloo Edge because the validating webhook behavior was significantly different, and I was unable to reproduce the issue on 1.14

jbohanon commented 2 months ago

This work is merged into all relevant versions as detailed in the comment above.

solo-io / gloo