Open zmerlynn opened 3 months ago
Are we requiring that all changes (for Feature Gate, SDK Server, SDK Client, Config) must go from Alpha -> Beta -> Stable? Or are we allowing for changes to go Alpha -> Stable (skipping Beta), or Beta -> Stable (skipping Alpha)?
Are we requiring that all changes (for Feature Gate, SDK Server, SDK Client, Config) must go from Alpha -> Beta -> Stable? Or are we allowing for changes to go Alpha -> Stable (skipping Beta), or Beta -> Stable (skipping Alpha)?
I think Beta -> Stable (skipping Alpha) should be a supported pattern for non-contentious features that we want to allow disablement of (think infrastructure features that might break things, but we think are ok to default on).
I would prefer not to support Alpha -> Stable - we don't get a lot of coverage from Alpha testing anyways, so I have trouble imagining a scenario where Alpha -> Stable makes sense.
LGTM - just a couple of extra thoughts:
The system must function normally during an upgrade, from a user's perspective
Adding an item: "Should not delete / interrupt existing GameServers
or Fleets
(probably implied, but right now the Helm template will delete Fleets, GameServers, etc- so would be good to be explicit).
May also want to make a note that if the proto doesn't change, but we update SDKs, then the end user can stay with their current SDK version until such time as they are willing to spend the time to upgrade.
As part of this work, should we spend some time making a jsonschema for helm? Or at least start? That would be a step in the right direction of ongoing maintenance, and warnings about deprecation (in theory?).
LGTM - just a couple of extra thoughts:
UI/UX
The system must function normally during an upgrade, from a user's perspective
Adding an item: "Should not delete / interrupt existing
GameServers
orFleets
(probably implied, but right now the Helm template will delete Fleets, GameServers, etc- so would be good to be explicit).
Good call out, added to UI/UX
Client SDKs
May also want to make a note that if the proto doesn't change, but we update SDKs, then the end user can stay with their current SDK version until such time as they are willing to spend the time to upgrade.
Agreed, added to Client SDKs
Users are not required to update Client SDK in game server binaries except for SDK proto deprecations, or breaking Alpha API changes. We foresee that evaluating updating Client SDKs in game server binaries will be yearly toil, if using Stable APIs, or semiannual toil, if using any Beta API; this toil can be as simple as verifying that there were no deprecations in the period involved. Given how stable our core APIs have been, it may be possible to go multiple years without updating the Client SDK.
Configuration / Helm Values
As part of this work, should we spend some time making a jsonschema for helm? Or at least start? That would be a step in the right direction of ongoing maintenance, and warnings about deprecation (in theory?).
I like it! Added to Config:
JSON schema for helm: We should add a JSON schema for our Helm chart. If we've made a breaking change in our Helm charts, this will make it more obvious to our users. (TBD: It could also potentially give us a way to enforce version horizons, since we could intentionally change schema elements in a way that there would be an error if you tried to skip too far at once.)
This the one @markmandel
This is exciting!
In UI/UX where you mention an upgrade to 1.41 fails, upgrade to 1.42 without rollback to 1.41 is not supported. Do you mean 1.40? Or does it means we need to re-apply 1.41 until it succeeds?
I also think we should mention what will happen with metrics during the upgrades. I would be in favor of no disruption 😄
I also think we should mention what will happen with metrics during the upgrades. I would be in favor of no disruption 😄
Ooh that's a good point. We probably need some sort of guarantees on metrics between versions as well. (hello client-go metrics that just stopped working -- that would mean we need to fix those!).
Overview
In this issue, I provide a roadmap to implement in-place upgrades in Agones. An in-place upgrade is the ability to use
kubectl apply
of a new configuration, on an Agones cluster with active Agones objects, crucially GameServer and Fleet objects, while supporting active allocations. We don’t support in-place upgrades today, as noted in #1742, #2843, and probably many other issues.We defined “upgrade” here more liberally than just version upgrades: Agones supports Feature Gates, and changing a feature gate can be analogous to a version upgrade, as changes to the gate may enable (or disable) new parts of the API. Additionally, there are other parts of the configuration, such as the
sdk-server
resources, that are relevant to renderingGameServer
objects - as such, updating configuration relevant to howGameServer
is rendered also needs to be considered.Requirements / Constraints
I propose the following requirements and constraints:
UI/UX #
The system must function normally during an upgrade, from a user's perspective
GameServers
orFleets
.GameServers
must not be disrupted by an Agones upgradeUpgrading via
kubectl apply
must be supported.helm
installs, but our users often use akubectl apply
based lifecycle, usinghelm
as the renderer (skaffold
is a common flow for this as well).kubectl apply
is not well supported, and we will need to monitor this carefully if we make deployment changes. We haven’t actually needed to prune between versions yet, but we should consider embracing something like ApplySet. However, this can be a later follow-on and may be easier to implement in an operator.helm
will still be preferable, as it allows for e.g. rollback behavior if the new version fails to install. Here we are simply saying thatapply
is a requirement.Helm is still the renderer of choice - in particular, modifying and rendering manifests should be done using helm, e.g. our docs.
agones-controller
) are not a stable interface and may change between versions.We support only two configurations of Agones running at the same time, i.e.
1.{N}
->1.{N+1}
or on rollback,1.{N}
->1.{N-1}
(this express a version change, but feature gate or configuration changes are similar). Deploying a third configuration while there are still mixed objects of the other two is not supported. This includes the sdk-server, meaning it’s not supported to do rapid-fire upgrades without waiting for Fleets to be patched to the new version and all allocated GameServer sessions to complete.Version Upgrades #
Feature Gates #
We must support enabling Feature Gates during configuration changes.
Upgrading a configuration running Alpha Feature Gates may result in lost configuration and/or disable the feature.
Disabling a Beta Feature Gate may result in lost configuration and/or disable the feature.
Newly enabled API features may be unreliable until an upgrade is fully rolled out to the Agones Control Plane. Some features may also rely on a full
sdk-server
rollout as well (Counts & Lists is a great example).Logic for newly enabled features must assume other controller replicas may not enforce the same logic until the upgrade is fully rolled out.
SDK proto / sdk-server #
This section is specifically about the SDK proto definitions, the interchange between the Client SDKs and the
sdk-server
. We discuss the Client SDKs below.In order to allow compatibility between game server binaries and sdk-server, a game server binary using Beta and Stable SDK protos must remain compatible with a newer sdk-server. Our SDK compatibility contract is: If your game server uses a non-deprecated Stable API, your binary will be compatible for 10 releases (~1y) starting from the SDK version packaged - e.g. if the game server uses non-deprecated APIs in the 1.40 SDK, it will be compatible through the 1.50 sdk-server. Stable APIs will almost certainly be compatible beyond 10 releases, but 10 releases is guaranteed. Similarly, using a non-deprecated Beta API, your binary will be compatible for 5 releases (~6mo).
Alpha SDK APIs/RPCs are subject to change between releases - a game server binary using Alpha SDKs may not be compatible with a newer sdk-server. In Alpha, incompatible changes retaining the same SDK proto message name are allowed. When we make incompatible Alpha changes, we will document the APIs involved.
We will document the maturity level and history of each Stable/Beta proto message, i.e.:
Foo (Stable in 1.40+; Beta in 1.38, deprecated 1.40, removed 1.45)
Client SDKs #
This section is specifically about the Client SDKs, the per-language SDK. We discuss the SDK proto surface above.
Alpha Client SDK APIs are subject to change between releases at any time. Beta Client SDK APIs should be considered generally stable, but we may alter the parameters/returns if we need to. Stable SDK APIs require a 5 release (6mo) deprecation cycle to remove, using a similar procedure to Stable SDK protos (make new API available, document migration, start timer, remove deprecated API).
Users are not required to update Client SDK in game server binaries except for SDK proto deprecations, or breaking Alpha API changes. We foresee that updating Client SDKs in game server binaries will be yearly toil, if using Stable APIs, or semiannual toil, if using any Beta API; this toil can be as simple as verifying that there were no deprecations in the period involved. Given how stable our core APIs have been, it may be possible to go multiple years without updating the Client SDK.
Configuration / Helm Values #
The names of feature gates will be usable indefinitely in Agones binaries, as long as they match the new forever-value. This is a change from today, where we reject gates that are not “active”. We will keep a map of retired gate names and whether they were retired true (every Stable gate) or false (retracted Alpha gates). If we see an old gate name with the right value, we will log a warning and continue - if the value does not match, we still reject it.
Helm values.yaml structure may be altered, but within the upgrade horizon, the old structure should still be usable and implicitly migrated.
Milestones #
The project will be split up into the following milestones, each described in separate issues.
Alpha #
At Alpha support level, applying an upgrade will upgrade the Agones Control Plane. However, Fleets will still use the old render configuration (
sdk-server
version, etc.) until updated manually; we will publish instructions on how to update Fleets to the new sdk-server.SDK Compatibility: Ensure that the sdk-server sidecar remains compatible within the upgrade horizon for Beta and Stable APIs.
Storage Compatibility: Ensure that a controller at one configuration can read that were stored objects of a controller in a different configuration. Usually this will be the result of a feature gate change between configurations, i.e.
FeatureFoo
adds theFoo
field to GameServer - if one configuration has the gate enabled and the other does not, we need to reason about object storage / compatibility. We will also write developer policies regarding API feature progression.Testing: We need automated testing before we should consider this Alpha.
Beta #
Versioned Configuration: For any configuration relevant to how the controller renders Pods, we will move the configuration to a ConfigMap that also keeps a generation number. This will allow us to tie rendered elements below Fleet to the specific configuration they were rendered under. We will then (optionally) roll out the new rendered configuration across all Fleet objects, or provide a simple workflow to do so.
JSON schema for helm: We should add a JSON schema for our Helm chart. If we've made a breaking change in our Helm charts, this will make it more obvious to our users. (TBD: It could also potentially give us a way to enforce version horizons, since we could intentionally change schema elements in a way that there would be an error if you tried to skip too far at once.)