googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
5.93k stars 780 forks source link

RFC: In-place Agones Upgrades #3766

Open zmerlynn opened 3 months ago

zmerlynn commented 3 months ago

[!NOTE] Request for comments - please provide feedback by 2024/05/01!

Overview

In this issue, I provide a roadmap to implement in-place upgrades in Agones. An in-place upgrade is the ability to use kubectl apply of a new configuration, on an Agones cluster with active Agones objects, crucially GameServer and Fleet objects, while supporting active allocations. We don’t support in-place upgrades today, as noted in #1742, #2843, and probably many other issues.

We defined “upgrade” here more liberally than just version upgrades: Agones supports Feature Gates, and changing a feature gate can be analogous to a version upgrade, as changes to the gate may enable (or disable) new parts of the API. Additionally, there are other parts of the configuration, such as the sdk-server resources, that are relevant to rendering GameServer objects - as such, updating configuration relevant to how GameServer is rendered also needs to be considered.

Requirements / Constraints

I propose the following requirements and constraints:

UI/UX #

  1. The system must function normally during an upgrade, from a user's perspective

    • Should not delete / interrupt existing GameServers or Fleets.
    • Allocated GameServers must not be disrupted by an Agones upgrade
      • FleetAutoScaler buffers should be maintained, etc.
    • GameServer allocations must continue during an Agones upgrade
  2. Upgrading via kubectl apply must be supported.

    • Why? Our end-to-end tests focus exclusively on helm installs, but our users often use a kubectl apply based lifecycle, using helm as the renderer (skaffold is a common flow for this as well).
    • Pruning using kubectl apply is not well supported, and we will need to monitor this carefully if we make deployment changes. We haven’t actually needed to prune between versions yet, but we should consider embracing something like ApplySet. However, this can be a later follow-on and may be easier to implement in an operator.
    • Using helm will still be preferable, as it allows for e.g. rollback behavior if the new version fails to install. Here we are simply saying that apply is a requirement.
  3. Helm is still the renderer of choice - in particular, modifying and rendering manifests should be done using helm, e.g. our docs.

    • In particular, just changing the version of the image in the manifest is not a recommended upgrade path - it may work sometimes, but the install should be re-rendered between versions. (Upgrading by applying our released install.yaml is supported - this is rendered every release.).
    • Relatedly, the arguments to our deployed binaries (e.g. agones-controller) are not a stable interface and may change between versions.
    • Why? We need to be able to make deployment changes - we did this with the HA Agones work, for instance.
  4. We support only two configurations of Agones running at the same time, i.e. 1.{N} -> 1.{N+1} or on rollback, 1.{N} -> 1.{N-1} (this express a version change, but feature gate or configuration changes are similar). Deploying a third configuration while there are still mixed objects of the other two is not supported. This includes the sdk-server, meaning it’s not supported to do rapid-fire upgrades without waiting for Fleets to be patched to the new version and all allocated GameServer sessions to complete.

    • Example:
      • 1.40 is installed
      • An upgrade to 1.41 fails, leaving the cluster in indeterminate state
      • It is not supported to attempt to upgrade to 1.42 without first rolling back 1.41.
    • We will have optional automation to reconcile all Fleets to the new configuration, plus documentation for how to do it manually.
    • Why? It’s hard enough to reason about two versions. Mixing three versions is mass hysteria.

Version Upgrades #

  1. For each release after we support upgrades, we will publish a supported upgrade horizon for that version. We may move the upgrade horizon forward as necessary, but we will always support an upgrade horizon of at least one version. (Initially, we may only support a build horizon of one version, but at the very least we should get in the habit of publishing it.)
    • Larger horizons are more user friendly, but if upgrades are truly non-disruptive, it may encourage users to upgrade more frequently. We’ll need to find a balance between “complexity to maintainers” and “simplicity to users”.
    • Example:
      • When 1.50 is released, we publish an upgrade horizon of 1.45, meaning that any version in [1.45, 1.50) can be upgraded to 1.50.
      • However, to support an intricate migration, we may say that 1.51 has an upgrade horizon of 1.50.
      • In this example, to upgrade from 1.45 to 1.51 would require an upgrade from 1.45 -> 1.50, then 1.50 -> 1.51.
    • Why? Supporting an indefinitely long upgrade horizon often results in intractable complexity. Separately, forcing “landing pad” releases (such as 1.50 in the above example) solves some migration issues, e.g. you can write API-storage migration code in one release and then rip out the migration in a future release after moving the horizon forward.

Feature Gates #

  1. We must support enabling Feature Gates during configuration changes.

    • Feature gate enablement may/will occur during a version upgrade (e.g. a feature is promoted to Beta in the version being upgraded to), or because the feature gate is explicitly enabled)
    • Why? This is table stakes for the feature, as we commonly promote feature gates between versions.
      • I considered requiring a flow where we upgrade versions first then upgrade feature gates, but unless/until we have an operator, a phased upgrade is complicated for users.
  2. Upgrading a configuration running Alpha Feature Gates may result in lost configuration and/or disable the feature.

    • Why? “Support for this feature may be dropped at any time without notice. The API may change in incompatible ways in a later software release without notice.” [source]. We will do our best to keep Alpha APIs consistent, but the safest way to change Alpha APIs is to abandon the original name/structure, resulting in a loss of the configuration. We will not write CRD migration code for Alpha features.
    • Breaking changes in Alpha features will be called out in upgrade documentation, alongside the build horizon, allowing operators to plan their upgrades.
  3. Disabling a Beta Feature Gate may result in lost configuration and/or disable the feature.

    • Why? Similar to above, if a Beta gate is disabled, the feature capabilities and configuration may be lost entirely.
    • We will write CRD migration code for Beta features if APIs change. This is a change from the existing documentation that says we won’t. I am proposing that we raise the bar for Beta.
  4. Newly enabled API features may be unreliable until an upgrade is fully rolled out to the Agones Control Plane. Some features may also rely on a full sdk-server rollout as well (Counts & Lists is a great example).

    • Newly enabled API features may “flap” as old-version controllers remove the unknown fields.
    • Why? A common problem with API-related features is that one replica knows about the API and another does not. The general user guidance is to avoid using new APIs until the upgrade is finished.
      • This constraint lets us be slightly lazy - otherwise we would need a consensus system before we used a new feature gate.
      • Note that this same problem exists in kube-apiserver, and e.g. GKE has ~30m regional cluster upgrades - it has not been a big issue.
  5. Logic for newly enabled features must assume other controller replicas may not enforce the same logic until the upgrade is fully rolled out.

    • Why? Similar to the API case above, within a single replica you can assume consistency, but until an upgrade is complete, other replicas may run with different configurations. Even in the case of agones-controller features where the controller is leader-elected, you have to assume that the leader may be lost (eviction, crash, etc.)

SDK proto / sdk-server #

This section is specifically about the SDK proto definitions, the interchange between the Client SDKs and the sdk-server. We discuss the Client SDKs below.

  1. In order to allow compatibility between game server binaries and sdk-server, a game server binary using Beta and Stable SDK protos must remain compatible with a newer sdk-server. Our SDK compatibility contract is: If your game server uses a non-deprecated Stable API, your binary will be compatible for 10 releases (~1y) starting from the SDK version packaged - e.g. if the game server uses non-deprecated APIs in the 1.40 SDK, it will be compatible through the 1.50 sdk-server. Stable APIs will almost certainly be compatible beyond 10 releases, but 10 releases is guaranteed. Similarly, using a non-deprecated Beta API, your binary will be compatible for 5 releases (~6mo).

    • We expect deprecations of Stable APIs to be exceedingly rare and thoroughly documented. The deprecation timer will start when the new APIs are available in Stable, with migration guides from the old APIs - SDK proto message names for the old/new functionality will remain distinct as well.
    • Beta deprecations will occur on graduation to Stable, which will involve introducing the APIs to Stable and marking the Beta APIs as deprecated. This deprecation follows the guarantee above, meaning the graduated APIs will remain in both Stable and Beta for 5 releases (6mo).
    • Why? SDK proto compatibility is crucial to reducing toil on our users. We don’t want to require users to redeploy binaries every release (6w!), and like to encourage upgrades as much as possible.
  2. Alpha SDK APIs/RPCs are subject to change between releases - a game server binary using Alpha SDKs may not be compatible with a newer sdk-server. In Alpha, incompatible changes retaining the same SDK proto message name are allowed. When we make incompatible Alpha changes, we will document the APIs involved.

    • We will emphasize in the Alpha APIs that any use of Alpha APIs may prevent upgrades - at the very least, it will require you to pay attention to releases.
    • Since we aren’t guaranteeing proto compatibility between releases for Alpha SDK protos, there is no need to overlap Alpha/Beta on graduation. When Alpha APIs are graduated to Beta, the Alpha APIs will be deleted, with no overlapping release.
    • Why? Alpha means Alpha. We love testers - that’s great! But we need to be able to change Alpha APIs/RPCs.
  3. We will document the maturity level and history of each Stable/Beta proto message, i.e.: Foo (Stable in 1.40+; Beta in 1.38, deprecated 1.40, removed 1.45)

    • This is probably best presented as a table.
    • Why? This allows users to easily understand if they are adopting an API that can’t be rolled back from. For example, with Foo above, the API is Stable in 1.40 - which means it’s not present in the Stable surface at all before 1.40, meaning they are pinned to 1.40+, though they could use the Beta surface to retain compatibility in [1.38, 1.44].

Client SDKs #

This section is specifically about the Client SDKs, the per-language SDK. We discuss the SDK proto surface above.

  1. Alpha Client SDK APIs are subject to change between releases at any time. Beta Client SDK APIs should be considered generally stable, but we may alter the parameters/returns if we need to. Stable SDK APIs require a 5 release (6mo) deprecation cycle to remove, using a similar procedure to Stable SDK protos (make new API available, document migration, start timer, remove deprecated API).

    • Example: #3738 changes the returns of certain functions in the Go SDK. This change is fine in Alpha, should be rare in Beta but still allowed. However, if this API had made it to Stable, we would instead introduce new functions, deprecate the old, and remove them after 6mo.
    • Why? The burden of API stability is different for the Client SDK - for one, the persona doing the update is different in that Client SDK updates are usually handled by a developer, whereas most of the rest of this doc is related to changes made by a cluster administrator. We don’t want to burden developers too much or they will never update their SDK - but we also need to be able to modify it.
  2. Users are not required to update Client SDK in game server binaries except for SDK proto deprecations, or breaking Alpha API changes. We foresee that updating Client SDKs in game server binaries will be yearly toil, if using Stable APIs, or semiannual toil, if using any Beta API; this toil can be as simple as verifying that there were no deprecations in the period involved. Given how stable our core APIs have been, it may be possible to go multiple years without updating the Client SDK.

Configuration / Helm Values #

  1. The names of feature gates will be usable indefinitely in Agones binaries, as long as they match the new forever-value. This is a change from today, where we reject gates that are not “active”. We will keep a map of retired gate names and whether they were retired true (every Stable gate) or false (retracted Alpha gates). If we see an old gate name with the right value, we will log a warning and continue - if the value does not match, we still reject it.

    • Why? This will allow manifests that include old gates to remain working at new versions, as long as the enablement value matches. (We will continue to fail brightly in the case that something is wrong, though.)
  2. Helm values.yaml structure may be altered, but within the upgrade horizon, the old structure should still be usable and implicitly migrated.

    • Why? We need to be able to refactor values.yaml as needs change, so it can’t be a forever contract. However, we can give users some time to migrate.
    • TBD: It looks like Helm still has no capability to warn. Filed https://github.com/helm/helm/issues/12937 to ask about this - there’s a suggested alternate route in that issue as well.

Milestones #

The project will be split up into the following milestones, each described in separate issues.

Alpha #

At Alpha support level, applying an upgrade will upgrade the Agones Control Plane. However, Fleets will still use the old render configuration (sdk-server version, etc.) until updated manually; we will publish instructions on how to update Fleets to the new sdk-server.

  1. SDK Compatibility: Ensure that the sdk-server sidecar remains compatible within the upgrade horizon for Beta and Stable APIs.

  2. Storage Compatibility: Ensure that a controller at one configuration can read that were stored objects of a controller in a different configuration. Usually this will be the result of a feature gate change between configurations, i.e. FeatureFoo adds the Foo field to GameServer - if one configuration has the gate enabled and the other does not, we need to reason about object storage / compatibility. We will also write developer policies regarding API feature progression.

  3. Testing: We need automated testing before we should consider this Alpha.

Beta #

  1. Versioned Configuration: For any configuration relevant to how the controller renders Pods, we will move the configuration to a ConfigMap that also keeps a generation number. This will allow us to tie rendered elements below Fleet to the specific configuration they were rendered under. We will then (optionally) roll out the new rendered configuration across all Fleet objects, or provide a simple workflow to do so.

  2. JSON schema for helm: We should add a JSON schema for our Helm chart. If we've made a breaking change in our Helm charts, this will make it more obvious to our users. (TBD: It could also potentially give us a way to enforce version horizons, since we could intentionally change schema elements in a way that there would be an error if you tried to skip too far at once.)

igooch commented 3 months ago

Are we requiring that all changes (for Feature Gate, SDK Server, SDK Client, Config) must go from Alpha -> Beta -> Stable? Or are we allowing for changes to go Alpha -> Stable (skipping Beta), or Beta -> Stable (skipping Alpha)?

zmerlynn commented 3 months ago

Are we requiring that all changes (for Feature Gate, SDK Server, SDK Client, Config) must go from Alpha -> Beta -> Stable? Or are we allowing for changes to go Alpha -> Stable (skipping Beta), or Beta -> Stable (skipping Alpha)?

I think Beta -> Stable (skipping Alpha) should be a supported pattern for non-contentious features that we want to allow disablement of (think infrastructure features that might break things, but we think are ok to default on).

I would prefer not to support Alpha -> Stable - we don't get a lot of coverage from Alpha testing anyways, so I have trouble imagining a scenario where Alpha -> Stable makes sense.

markmandel commented 3 months ago

LGTM - just a couple of extra thoughts:

UI/UX

The system must function normally during an upgrade, from a user's perspective

Adding an item: "Should not delete / interrupt existing GameServers or Fleets

(probably implied, but right now the Helm template will delete Fleets, GameServers, etc- so would be good to be explicit).

Client SDKs

May also want to make a note that if the proto doesn't change, but we update SDKs, then the end user can stay with their current SDK version until such time as they are willing to spend the time to upgrade.

Configuration / Helm Values

As part of this work, should we spend some time making a jsonschema for helm? Or at least start? That would be a step in the right direction of ongoing maintenance, and warnings about deprecation (in theory?).

zmerlynn commented 3 months ago

LGTM - just a couple of extra thoughts:

UI/UX

The system must function normally during an upgrade, from a user's perspective

Adding an item: "Should not delete / interrupt existing GameServers or Fleets

(probably implied, but right now the Helm template will delete Fleets, GameServers, etc- so would be good to be explicit).

Good call out, added to UI/UX

Client SDKs

May also want to make a note that if the proto doesn't change, but we update SDKs, then the end user can stay with their current SDK version until such time as they are willing to spend the time to upgrade.

Agreed, added to Client SDKs

Users are not required to update Client SDK in game server binaries except for SDK proto deprecations, or breaking Alpha API changes. We foresee that evaluating updating Client SDKs in game server binaries will be yearly toil, if using Stable APIs, or semiannual toil, if using any Beta API; this toil can be as simple as verifying that there were no deprecations in the period involved. Given how stable our core APIs have been, it may be possible to go multiple years without updating the Client SDK.

Configuration / Helm Values

As part of this work, should we spend some time making a jsonschema for helm? Or at least start? That would be a step in the right direction of ongoing maintenance, and warnings about deprecation (in theory?).

I like it! Added to Config:

JSON schema for helm: We should add a JSON schema for our Helm chart. If we've made a breaking change in our Helm charts, this will make it more obvious to our users. (TBD: It could also potentially give us a way to enforce version horizons, since we could intentionally change schema elements in a way that there would be an error if you tried to skip too far at once.)

1804devs commented 2 months ago

This the one @markmandel

EricFortin commented 2 months ago

This is exciting!

In UI/UX where you mention an upgrade to 1.41 fails, upgrade to 1.42 without rollback to 1.41 is not supported. Do you mean 1.40? Or does it means we need to re-apply 1.41 until it succeeds?

I also think we should mention what will happen with metrics during the upgrades. I would be in favor of no disruption 😄

markmandel commented 2 months ago

I also think we should mention what will happen with metrics during the upgrades. I would be in favor of no disruption 😄

Ooh that's a good point. We probably need some sort of guarantees on metrics between versions as well. (hello client-go metrics that just stopped working -- that would mean we need to fix those!).