It is common among our customers that different teams manage their own VS/Routes. Having the replace invalid routes active ensure that mistakes of one team can't impair other services of other people.
However, if one of the teams manage to produce a configuration that is valid for gloo and invalid for envoy, the whole system is affected, the snapshot generation is blocked. In this scenario, new instances of the proxy won't get a valid snapshot.
Steps to reproduce the bug
Install gloo with invalid routes feature active
cat << 'EOF' > values.yaml
gloo:
settings:
invalidConfigPolicy:
invalidRouteResponseBody: Gloo Edge has invalid configuration. Administrators should run `glooctl check` to find and fix config errors.
invalidRouteResponseCode: 404
replaceInvalidRoutes: true
EOF
helm upgrade -i gloo glooe/gloo-ee --namespace gloo-system --version 1.10.3 \
--create-namespace --set-string license_key="$LICENSE_KEY" -f values.yaml
create namespaces for different teams, that have different domains (to avoid this issue)
kubectl create ns team1
kubectl create ns team2
Apply the changes from team1 and team2, both are accepted but you can see that in glooctl check the snapshot was rejected Video 00:25-01:02
k apply -f vs-team1-broken.yaml -n team1
k apply -f vs-team2-valid.yaml -n team2
glooctl check
Now fix it, delete the offending resource and apparently it works, the snapshot is now accepted by envoy Video 01:10-01:43k delete -f vs-team1-broken.yaml -n team1
Delete all remaining services, this should create a new snapshot with no routes Video 01:47k delete -f vs-team2-valid.yaml -n team2
Apply the offending service again, we see envoy is rejecting it and falling back to the latest stable snapshot, that is not the one from the previous step?? Video 01:59
k apply -f vs-team1-broken.yaml -n team1
glooctl check
Expected Behavior
Gloo should be able to ignore any configuration rejected by envoy, even if it appears to be valid from its own perspective.
This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.
Gloo Edge Version
1.10.x (latest stable)
Kubernetes Version
1.21.x
Describe the bug
It is common among our customers that different teams manage their own VS/Routes. Having the replace invalid routes active ensure that mistakes of one team can't impair other services of other people.
However, if one of the teams manage to produce a configuration that is valid for gloo and invalid for envoy, the whole system is affected, the snapshot generation is blocked. In this scenario, new instances of the proxy won't get a valid snapshot.
Steps to reproduce the bug
Install gloo with invalid routes feature active
create namespaces for different teams, that have different domains (to avoid this issue)
Apply the changes from team1 and team2, both are accepted but you can see that in
glooctl check
the snapshot was rejectedVideo 00:25-01:02
Now fix it, delete the offending resource and apparently it works, the snapshot is now accepted by envoy
Video 01:10-01:43
k delete -f vs-team1-broken.yaml -n team1
Delete all remaining services, this should create a new snapshot with no routes
Video 01:47
k delete -f vs-team2-valid.yaml -n team2
Apply the offending service again, we see envoy is rejecting it and falling back to the latest stable snapshot, that is not the one from the previous step??
Video 01:59
Expected Behavior
Gloo should be able to ignore any configuration rejected by envoy, even if it appears to be valid from its own perspective.
Additional Context
vs-team1-broken.yaml mistake explained here
vs-team2-valid.yaml
https://user-images.githubusercontent.com/35881711/153257647-818f35f7-4a0f-48be-8e87-a92ac83b6a5a.mov