solo-io / gloo

The Cloud-Native API Gateway and AI Gateway
https://docs.solo.io/
Apache License 2.0
6 stars 0 forks source link

fix upstream status flicker and constant status updates #10384

Closed lgadban closed 1 day ago

lgadban commented 6 days ago

Description

fix upstream status flicker

When only Kube GW proxies are present, we still rely on the edge translator_syncer for extension syncing. The edge translator will mark Upstreams & UpstreamGroups as Accepted then perform xds translation where status may be changed to e.g. Rejected if there is an error.

However, in the case where there are no edge proxies, translation doesn't actually occur, so any actual errors on the Upstream are never encountered, thus the status is never set to Rejected. We end up in a scenario where the Kube GW syncer (correctly) reports Rejected status while the Edge syncer reports Accepted and they will fight each other indefinitely.

This is fixed by no longer reporting Accepted on Upstream[Groups] unless they are also going to be translated. The result is that an Upstream may have empty status if there is no proxy (either edge of kube gw) present, but this means that only accurate status from translation will be reported.

fix constant status updates

Additionally we noticed endless updates (and webhook hits) for resources tracked in krt.

the status reporter compares the desired status with the existing status in the solo-kit object to determine if it should actually UPDATE the resource.

the current proxy_syncer will do a once per second status sync and relies on this status comparison to be functional to prevent endless object UPDATEs.

this commit fixes the solo-kit objects (really wrappers) in the krt collections to contain the status so an accurate comparison can take place.

test updates

Several tests relied on Accepted status for Upstreams when no translation occurred, so they were updated to not rely on status but just the existence of the resource in question.

glooctl

glooctl does NOT need to be updated as the "no status" case that reports an error is only a problem when the status is nil rather than just an empty status: https://github.com/solo-io/gloo/blob/89a427a6a28491b3b4366a0f2492adad83a4d877/projects/gloo/cli/pkg/cmd/check/root.go#L433-L452

Code changes

This changes the edge translator_syncer to no longer mark Upstream[Group]s as Accepted unless it will also perform translation.

Keep and convert status from k8s objects when converting them to solo-kit types for krt collections

Update broken tests

Checklist:

github-actions[bot] commented 4 days ago

Visit the preview URL for this PR (updated for commit 0f2a5f6):

https://gloo-edge--pr10384-us-status-flicker-8uiiree0.web.app

(expires Sun, 01 Dec 2024 17:00:36 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

solo-changelog-bot[bot] commented 3 days ago

Issues linked to changelog: https://github.com/solo-io/solo-projects/issues/7243 https://github.com/solo-io/solo-projects/issues/7257

solo-changelog-bot[bot] commented 2 days ago

Issues linked to changelog: https://github.com/solo-io/solo-projects/issues/7243 https://github.com/solo-io/solo-projects/issues/7257 https://github.com/solo-io/gloo/issues/10401

lgadban commented 1 day ago

What are the ways that I could test this manually to prove that the flickering no longer happens?

@sam-heilbron the steps to repro should probably have been in https://github.com/solo-io/solo-projects/issues/7243 but the gist of it is also captured in https://github.com/solo-io/gloo/issues/10401

The high-level summary is:

Now the upstream is being translated by kube gw and reporting a warning correctly but the edge syncer will be reporting accepted so they will fight and flicker.

With the changes in this PR that no longer happens as the Edge syncer won't report Accepted for a resource it hasn't translated