A mistake in one VS can block the whole system from getting updates

jmunozro commented 2 years ago

Gloo Edge Version

1.10.x (latest stable)

Kubernetes Version

1.21.x

Describe the bug

It is common among our customers that different teams manage their own VS/Routes. Having the replace invalid routes active ensure that mistakes of one team can't impair other services of other people.

However, if one of the teams manage to produce a configuration that is valid for gloo and invalid for envoy, the whole system is affected, the snapshot generation is blocked. In this scenario, new instances of the proxy won't get a valid snapshot.

Steps to reproduce the bug

Install gloo with invalid routes feature active

cat << 'EOF' > values.yaml
gloo:
  settings:
    invalidConfigPolicy:
      invalidRouteResponseBody: Gloo Edge has invalid configuration. Administrators should run `glooctl check` to find and fix config errors.
      invalidRouteResponseCode: 404
      replaceInvalidRoutes: true
EOF
helm upgrade -i gloo glooe/gloo-ee --namespace gloo-system --version 1.10.3 \
  --create-namespace --set-string license_key="$LICENSE_KEY" -f values.yaml

create namespaces for different teams, that have different domains (to avoid this issue)

kubectl create ns team1
kubectl create ns team2

Apply the changes from team1 and team2, both are accepted but you can see that in glooctl check the snapshot was rejected Video 00:25-01:02

k apply -f vs-team1-broken.yaml -n team1
k apply -f vs-team2-valid.yaml -n team2
glooctl check

Now fix it, delete the offending resource and apparently it works, the snapshot is now accepted by envoy Video 01:10-01:43 k delete -f vs-team1-broken.yaml -n team1

Delete all remaining services, this should create a new snapshot with no routes Video 01:47 k delete -f vs-team2-valid.yaml -n team2

Apply the offending service again, we see envoy is rejecting it and falling back to the latest stable snapshot, that is not the one from the previous step?? Video 01:59

k apply -f vs-team1-broken.yaml -n team1
glooctl check

Expected Behavior

Gloo should be able to ignore any configuration rejected by envoy, even if it appears to be valid from its own perspective.

Additional Context

vs-team1-broken.yaml mistake explained here

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: vs-team1-broken
spec:
  virtualHost:
    domains:
    - team1.jesus.com
    routes:
      - matchers:
          - prefix: /bad-route
        routeAction:
          single:
            upstream:
              name: default-httpbin-8000
              namespace: gloo-system
        options:
          prefixRewrite: /get
    options:
      ratelimit:
        rateLimits:
        - actions:
          - requestHeaders:
              descriptorKey: deviceId

vs-team2-valid.yaml

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: vs-team2-valid
spec:
  virtualHost:
    domains:
    - team2.jesus.com
    routes:
      - matchers:
          - prefix: /good-route
        routeAction:
          single:
            upstream:
              name: default-httpbin-8000
              namespace: gloo-system
        options:
          prefixRewrite: /get

https://user-images.githubusercontent.com/35881711/153257647-818f35f7-4a0f-48be-8e87-a92ac83b6a5a.mov

jmunozro commented 2 years ago

github-actions[bot] commented 5 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

k8sgateway / k8sgateway