solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

Write last xds snapshot to persisted storage #6115

Open kdorosh opened 2 years ago

kdorosh commented 2 years ago

Version

No response

Is your feature request related to a problem? Please describe.

we need to ensure Ensure Gloo Edge is reliable across all pod restarts and invalid configuration

the shortest path to ensure reliable xds configuration is always being served is to write last acked xds snapshot to persistent storage, and load that if gloo translation is unable to complete (related https://github.com/solo-io/gloo/issues/6114)

Describe the solution you'd like

alternative solution: write last xds cache to persistent storage. Size M, Risk M

Describe alternatives you've considered

No response

Additional Context

downside: breaks multitenancy, which may or may not be a product requirement in all gloo settings deployments

kdorosh commented 2 years ago

per discussion with @nrjpoddar and @kcbabo , the preferred long term solution is https://github.com/solo-io/gloo/issues/6114

That change is larger and risker; in the meantime we will add this support (temporarily) and deprecate and remove it once the other feature is implemented and well tested in the field.

chrisgaun commented 2 years ago

@kdorosh we need to consider the implications of adding private keys - certificates when not using SDS - to a PV. They would like to have the persistence in HA Redis. This can be follow up work.

kdorosh commented 2 years ago

@chrisgaun as noted earlier, the preferred long-term solution is https://github.com/solo-io/gloo/issues/6114 so all state is stored in etcd.

In the meantime, an encrypted volume (e.g. https://kubernetes.io/docs/concepts/storage/storage-classes/#aws-ebs) may be acceptable.

We could explore HA redis, but that seems similar to making xds-relay HA which might be preferable, although https://github.com/solo-io/gloo/issues/6114 is still preferred in my opinion

kdorosh commented 2 years ago

related blocker i ran into while doing the work https://github.com/solo-io/solo-kit/issues/461

kdorosh commented 2 years ago

related: https://github.com/solo-io/gloo/issues/5022

steps to reproduce:

update: if we don't run the route replacement sanitizer then we don't have this issue https://github.com/solo-io/gloo/blob/5ef15af3b91234b76588216cf21ee364f4af919c/projects/gloo/pkg/syncer/sanitizer/route_replacing_sanitizer.go#L168 (i.e., return xds snapshot and nil error here)

update: old config is also stuck, e.g. after deleting the service but before rolling pods apply:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
    - '*'
    routes:
    - matchers:
      - exact: /all-pets
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore-8080
            namespace: gloo-system
    - matchers:
      - exact: /all-pets2
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore2-8080
            namespace: gloo-system
    - matchers:
      - exact: /all-pets3
      options:
        prefixRewrite: /api/pets
      routeAction:
        single:
          upstream:
            name: default-petstore3-8080
            namespace: gloo-system
---
apiVersion: v1
kind: Service
metadata:
  name: petstore3
  namespace: default
  labels:
    service: petstore
spec:
  ports:
  - port: 8080
    protocol: TCP
  selector:
    app: petstore
---
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: default-petstore3-8080
  namespace: gloo-system
spec:
  kube:
    selector:
      app: petstore
    serviceName: petstore3
    serviceNamespace: default
    servicePort: 8080
kdorosh commented 2 years ago

also highly relevant to the initial ask here of persisting xds config; this is/was made much harder because we did not do this https://bryanftan.medium.com/accept-interfaces-return-structs-in-go-d4cab29a301b

we may want to investigate a refactor to make the implementation more future-proof

kdorosh commented 2 years ago

This may still be desirable to do depending on how hard it is to rewrite gateway translation to never fail once the gloo and gateway pods merge; fyi @elcasteel @sam-heilbron @nfuden

the code I wrote has been pushed to these branches

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.