Open kdorosh opened 2 years ago
per discussion with @nrjpoddar and @kcbabo , the preferred long term solution is https://github.com/solo-io/gloo/issues/6114
That change is larger and risker; in the meantime we will add this support (temporarily) and deprecate and remove it once the other feature is implemented and well tested in the field.
@kdorosh we need to consider the implications of adding private keys - certificates when not using SDS - to a PV. They would like to have the persistence in HA Redis. This can be follow up work.
@chrisgaun as noted earlier, the preferred long-term solution is https://github.com/solo-io/gloo/issues/6114 so all state is stored in etcd.
In the meantime, an encrypted volume (e.g. https://kubernetes.io/docs/concepts/storage/storage-classes/#aws-ebs) may be acceptable.
We could explore HA redis, but that seems similar to making xds-relay HA which might be preferable, although https://github.com/solo-io/gloo/issues/6114 is still preferred in my opinion
related blocker i ran into while doing the work https://github.com/solo-io/solo-kit/issues/461
related: https://github.com/solo-io/gloo/issues/5022
steps to reproduce:
kind create cluster --name kind --image kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6
glooctl install gateway enterprise --version 1.11.0-beta8 --license-key $LICENSE_KEY
glooctl install gateway --version 1.12.0-beta1
kubectl scale -n gloo-system deploy/discovery --replicas 0
kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo/v1.2.9/example/petstore/petstore.yaml
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: default
namespace: gloo-system
spec:
virtualHost:
domains:
- '*'
routes:
- matchers:
- exact: /all-pets
options:
prefixRewrite: /api/pets
routeAction:
single:
upstream:
name: default-petstore-8080
namespace: gloo-system
- matchers:
- exact: /all-pets2
options:
prefixRewrite: /api/pets
routeAction:
single:
upstream:
name: default-petstore2-8080
namespace: gloo-system
---
apiVersion: v1
kind: Service
metadata:
name: petstore
namespace: default
labels:
service: petstore
spec:
ports:
- port: 8080
protocol: TCP
selector:
app: petstore
---
apiVersion: v1
kind: Service
metadata:
name: petstore2
namespace: default
labels:
service: petstore
spec:
ports:
- port: 8080
protocol: TCP
selector:
app: petstore
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
name: default-petstore-8080
namespace: gloo-system
spec:
kube:
selector:
app: petstore
serviceName: petstore
serviceNamespace: default
servicePort: 8080
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
name: default-petstore2-8080
namespace: gloo-system
spec:
kube:
selector:
app: petstore
serviceName: petstore2
serviceNamespace: default
servicePort: 8080
kubectl port-forward -n gloo-system deploy/gateway-proxy 8080
curl -H "Host: foo" localhost:8080/all-pets
curl -H "Host: foo" localhost:8080/all-pets2
kubectl delete svc petstore2
curl -H "Host: foo" localhost:8080/all-pets
curl -H "Host: foo" localhost:8080/all-pets2
kubectl delete po -n gloo-system --all
curl -H "Host: foo" localhost:8080/all-pets
curl -H "Host: foo" localhost:8080/all-pets2
--> only this one should fail{"level":"warn","ts":"2022-04-07T14:55:55.263Z","logger":"gloo-ee.v1.event_loop.setup.gloosnapshot.event_loop.reporter","caller":"reporter/reporter.go:255","msg":"failed to write status state:Warning reason:\"warning: \\n 1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]","version":"1.11.0-beta7"}
{"level":"error","ts":"2022-04-07T14:55:55.265Z","logger":"gloo-ee.v1.event_loop.setup","caller":"setup/setup_syncer.go:668","msg":"gloo main event loop","version":"1.11.0-beta7","error":"event_loop.gloo: 1 error occurred:\n\t* writing reports: 1 error occurred:\n\t* failed to write status state:Warning reason:\"warning: \\n 1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]\n\n\n\n","errorVerbose":"1 error occurred:\n\t* writing reports: 1 error occurred:\n\t* failed to write status state:Warning reason:\"warning: \\n 1 error occurred:\\n\\t* Upstream name:\\\"default-petstore2-8080\\\" namespace:\\\"gloo-system\\\" references the service \\\"petstore2\\\" which does not exist in namespace \\\"default\\\"\\n\\n\" reported_by:\"gloo\" for resource default-petstore2-8080: updating kube resource default-petstore2-8080:112756 (want 112756): admission webhook \"gateway.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.Upstream failed: 1 error occurred:\n\t* Upstream name:\"default-petstore2-8080\" namespace:\"gloo-system\" references the service \"petstore2\" which does not exist in namespace \"default\"\n\n]\n\n\n\n\nevent_loop.gloo\ngithub.com/solo-io/go-utils/errutils.AggregateErrs\n\t/go/pkg/mod/github.com/solo-io/go-utils@v0.21.24/errutils/aggregate_errs.go:19\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/syncer/setup.RunGlooWithExtensions.func6\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.11.0-beta11/projects/gloo/pkg/syncer/setup/setup_syncer.go:668"}
update: if we don't run the route replacement sanitizer then we don't have this issue https://github.com/solo-io/gloo/blob/5ef15af3b91234b76588216cf21ee364f4af919c/projects/gloo/pkg/syncer/sanitizer/route_replacing_sanitizer.go#L168 (i.e., return xds snapshot and nil error here)
update: old config is also stuck, e.g. after deleting the service but before rolling pods apply:
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: default
namespace: gloo-system
spec:
virtualHost:
domains:
- '*'
routes:
- matchers:
- exact: /all-pets
options:
prefixRewrite: /api/pets
routeAction:
single:
upstream:
name: default-petstore-8080
namespace: gloo-system
- matchers:
- exact: /all-pets2
options:
prefixRewrite: /api/pets
routeAction:
single:
upstream:
name: default-petstore2-8080
namespace: gloo-system
- matchers:
- exact: /all-pets3
options:
prefixRewrite: /api/pets
routeAction:
single:
upstream:
name: default-petstore3-8080
namespace: gloo-system
---
apiVersion: v1
kind: Service
metadata:
name: petstore3
namespace: default
labels:
service: petstore
spec:
ports:
- port: 8080
protocol: TCP
selector:
app: petstore
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
name: default-petstore3-8080
namespace: gloo-system
spec:
kube:
selector:
app: petstore
serviceName: petstore3
serviceNamespace: default
servicePort: 8080
also highly relevant to the initial ask here of persisting xds config; this is/was made much harder because we did not do this https://bryanftan.medium.com/accept-interfaces-return-structs-in-go-d4cab29a301b
we may want to investigate a refactor to make the implementation more future-proof
This may still be desirable to do depending on how hard it is to rewrite gateway translation to never fail once the gloo and gateway pods merge; fyi @elcasteel @sam-heilbron @nfuden
the code I wrote has been pushed to these branches
This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.
Version
No response
Is your feature request related to a problem? Please describe.
we need to ensure Ensure Gloo Edge is reliable across all pod restarts and invalid configuration
the shortest path to ensure reliable xds configuration is always being served is to write last acked xds snapshot to persistent storage, and load that if gloo translation is unable to complete (related https://github.com/solo-io/gloo/issues/6114)
Describe the solution you'd like
alternative solution: write last xds cache to persistent storage. Size M, Risk M
Describe alternatives you've considered
No response
Additional Context
downside: breaks multitenancy, which may or may not be a product requirement in all gloo settings deployments