Closed lgadban closed 1 day ago
Visit the preview URL for this PR (updated for commit 0f2a5f6):
https://gloo-edge--pr10384-us-status-flicker-8uiiree0.web.app
(expires Sun, 01 Dec 2024 17:00:36 GMT)
🔥 via Firebase Hosting GitHub Action 🌎
Sign: 77c2b86e287749579b7ff9cadb81e099042ef677
Issues linked to changelog: https://github.com/solo-io/solo-projects/issues/7243 https://github.com/solo-io/solo-projects/issues/7257
What are the ways that I could test this manually to prove that the flickering no longer happens?
@sam-heilbron the steps to repro should probably have been in https://github.com/solo-io/solo-projects/issues/7243 but the gist of it is also captured in https://github.com/solo-io/gloo/issues/10401
The high-level summary is:
Gateway
to trigger a kube gw Proxy translation & syncNow the upstream is being translated by kube gw and reporting a warning correctly but the edge syncer will be reporting accepted so they will fight and flicker.
With the changes in this PR that no longer happens as the Edge syncer won't report Accepted for a resource it hasn't translated
Description
fix upstream status flicker
When only Kube GW proxies are present, we still rely on the edge translator_syncer for extension syncing. The edge translator will mark
Upstreams
&UpstreamGroups
asAccepted
then perform xds translation where status may be changed to e.g.Rejected
if there is an error.However, in the case where there are no edge proxies, translation doesn't actually occur, so any actual errors on the
Upstream
are never encountered, thus the status is never set toRejected
. We end up in a scenario where the Kube GW syncer (correctly) reports Rejected status while the Edge syncer reports Accepted and they will fight each other indefinitely.This is fixed by no longer reporting Accepted on Upstream[Groups] unless they are also going to be translated. The result is that an Upstream may have empty status if there is no proxy (either edge of kube gw) present, but this means that only accurate status from translation will be reported.
fix constant status updates
Additionally we noticed endless updates (and webhook hits) for resources tracked in krt.
the status reporter compares the desired status with the existing status in the solo-kit object to determine if it should actually UPDATE the resource.
the current proxy_syncer will do a once per second status sync and relies on this status comparison to be functional to prevent endless object UPDATEs.
this commit fixes the solo-kit objects (really wrappers) in the krt collections to contain the status so an accurate comparison can take place.
test updates
Several tests relied on Accepted status for Upstreams when no translation occurred, so they were updated to not rely on status but just the existence of the resource in question.
glooctl
glooctl does NOT need to be updated as the "no status" case that reports an error is only a problem when the status is nil rather than just an empty status: https://github.com/solo-io/gloo/blob/89a427a6a28491b3b4366a0f2492adad83a4d877/projects/gloo/cli/pkg/cmd/check/root.go#L433-L452
Code changes
This changes the edge translator_syncer to no longer mark
Upstream[Group]s
asAccepted
unless it will also perform translation.Keep and convert status from k8s objects when converting them to solo-kit types for krt collections
Update broken tests
Checklist: