solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.06k stars 433 forks source link

Canary Deployments with Gloo Federation #6127

Open guydc opened 2 years ago

guydc commented 2 years ago

Version

1.11.*

Is your feature request related to a problem? Please describe.

Gloo Edge supports in-place canary deployment: multiple control planes can reconcile the same CRs and produce XDS for two distinct data planes.

With Gloo Federation, It should be possible to perform a blue-green deployment that does not create any upgrade risk to existing clusters. Furthermore, Gloo Federation itself should support a blue-green deployment model, where a new federation version can be tested before it assumes control over existing clusters.

Describe the solution you'd like

This can be achieved by deploying an additional gloo-fed instance and creating new edge clusters with the latest gloo-edge version. Traffic is gradually shifted from old clusters to new ones. The Canary deployment concepts can be applied to Gloo Federation:

Describe alternatives you've considered

No response

Additional Context

No response

chrisgaun commented 2 years ago

Need estimate or alternatives.

chrisgaun commented 2 years ago

Need to understand level of effort on this one @sam-heilbron

rinormaloku commented 2 years ago

I tested as an alternative if we can run two Gloo Federation instances at once, the second instance running in the opposite cluster (where again all clusters need to be registered and all resources deployed). I didn't like the UX of this alternative, hence it is crossed out.

But what I would like to circle back to is: How important is it to deploy gloo fed using the canary pattern?

Gloo Federation is reading the Gloo Edge instances running in the clusters, picking up some configuration applied by the user making the configuration in the clusters so that cross-cluster traffic is possible, and failover works.

From then on there aren't ongoing changes that Gloo Federation needs to reconcile. If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

guydc commented 2 years ago

How important is it to deploy gloo fed using the canary pattern?

Gloo Fed is a privileged component that controls configuration for multiple edge control planes. I think that the blast radius from a malfunctioning new version can be significant. For example, consider a bug in the orphan termination functionality, that erases configuration from all federated clusters, leading to a complete system outage.

There are also inherent compatibility risks when following canary deployment practices for the edge control and data planes in a federated environment. Gloo Fed CRDs and clients may be incompatible with edges that are still running an older version. AFAIK, k8s CRD versioning practices are not applied, breaking changes occur from time to time, and downgrading is difficult in Gloo Edge:

IMHO, the safest way to upgrade a federated environment is:

This scheme is not always feasible, especially when the federation clusters require state synchronization. The next best thing would be to support an in-cluster gloo fed canary deployment.

These solutions would only work if Federated CRDs are properly versioned and deprecated.

If Gloo Fed is down it merely hinders applying a new configuration, but everything that was already applied keeps on working.

If Gloo Fed is down:

Having a pre-prod environment to test upgrading Gloo Federation should be all that's needed.

It's not always possible to have a pre-prod environment that completely simulates production.

rinormaloku commented 2 years ago

It's not always possible to have a pre-prod environment that completely simulates production.

That is an issue.

If Gloo Fed is down: New edges & DR -- (those are very rare cases, with low likelihood to occur, unless the feature is used in a way that I haven't seen up to now)

  • Service degrades as the system enters a "read-only" state

The third issue is the most likely issue to occur. But the impact is completely negligible. The implementation of Gloo Edge is purposefully different from Istio, Gloo edge doesn't configure the gateway proxy with endpoints (IP addresses for every pod; a luxury that a service mesh cannot afford as it would cause excessive load on the DNS proxy).

Summary: Gloo Fed will only make tweaks when you apply Gloo Fed CRDs. Or if you change the Loadbalancer service in one of the gloo instances. (Those changes are not frequent, and at least shouldn't be done when you make a Gloo Fed update)

Though without Pre-prod environments, there is no alternative but to have some canary deployment approach to reduce the risk.

chrisgaun commented 2 years ago

Can limit the scope to having Gloo Fed backwards compatible with GE.

guydc commented 2 years ago

Can limit the scope to having Gloo Fed backwards compatible with GE.

Right. For example, the Gloo Mesh Control Plane is compatible with n-1 version relay agents to support rolling upgrade scenarios. Ideally, Gloo Fed should have similar compatibility with Gloo Edge.

Otherwise, some form of protection is required, to ensure that state of n-1 GEs is not corrupted and that GF doesn't run into global failures due to unexpected GE version under federation.

jenshu commented 1 year ago

breakdown of tasks (not necessarily in order):

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.