googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6k stars 791 forks source link

Built-in mechanism to monitor whether a rollout via an Agones yaml change is complete #2817

Open roberthbailey opened 1 year ago

roberthbailey commented 1 year ago

Is your feature request related to a problem? Please describe. It would be nice to be able to see whether a rollout initiated via an Agones yaml change has completed. One option would be to compute and show the percentage of game servers using the new image during the rolling update (without needing to write a custom script to parse all get gs output).

Describe the solution you'd like We should investigate whether a Deployment exposes this information in some way. If so, we could copy what they do. If not, we should consider engaging with sig-apps to see if it is something they have thought about and/or plan to add in the future (so that we can be consistent).

Describe alternatives you've considered Leave it as-is and allow each user to write their own script / automation when orchestrating rollouts.

Additional context n/a

markmandel commented 1 year ago

There is a kubectl rollout that seems to work on Deployments, DaemonSets and a StatefulSets.

It looks really handy. I'm not sure how to manage it for CRDs though.

roberthbailey commented 1 year ago

I wonder how much is baked in to kubectl to support those specific resource types. It would be really neat if kubectl rollout worked in a generic fashion, like kubectl scale does with the scale subresource.

markmandel commented 1 year ago

I wonder how much is baked in to kubectl to support those specific resource types. It would be really neat if kubectl rollout worked in a generic fashion, like kubectl scale does with the scale subresource.

100%. I have a strong feeling it's baked into kubectl. I'm sure there are people we can find and ask though.

gongmax commented 1 year ago

Based on our internal discussion with Cloud Deploy and Sig-apps. I'm proposing the below design to achieve the short-term goal to expose the signal that indicates whether the Fleet rollout is completed. Feedbacks are appreciated.

Background

Today, Agones Fleet, which is a CRD, doesn’t expose any status to indicate whether a Fleet update is completed. And there is no integration between Agones and Cloud Deploy to show Fleet update progress. To meet customers’ request for Fleet update process tracking, and improve the experience using Agones with Cloud Deploy (which is the recommended path for multi-cluster management for customers migrating from GCGS to Agones + GKE), we should implement some status check for Fleet so users and the CD tools can know whether a rollout of Fleet is completed.

Kstatus is a library that provides tools for checking the status of Kubernetes resources. Specifically, it provides a set of standard conditions that can be adopted by CRD to provide the necessary information for understanding status of the reconcile and a set of functions that takes a single resource and computes the status for this resource based on the conditions in the status object for the resource. The computed status can then be used for status check by CD tools such as Cloud Deploy (Skaffold). Though it cannot show the percentage of rollout, Kstatus gives us a way to expose the binary indicator of rollout completion. This doc focuses on the design decisions to integrate Agones with Kstatus.

Solution

To integrate with the Kstatus library, we need to add the Reconciling condition to the Fleet status and update it accordingly. This condition defined in the Kstatus library is designed to adhere to the "abnormal-true" pattern. I.e. The presence of the condition with the value of True means the Fleet is in the process of reconciliation. Absence of the condition or a condition with the value of False means the latest observed generation of the Fleet manifest by the controller has been fully reconciled with the actual state.

We define a rollout of a Fleet is completed when it has the following characteristics:

In addition, because of use of the “abnormal-true” pattern on the Reconciliation condition (where false is the normal behavior) , we also need to adopt the pattern used by several of the built-in types where there is an observedGeneration property on the status object which is set by the controller during the reconcile loop to the current metadata.generation value. If the generation in the metadata and the status.observedGeneration of a resource does not match, it means there are changes that the controller has not yet seen, and therefore not reconciled on. In our case, statsus.observedGeneration equals to metadata.generation indicates that the change of Fleet.spec has been seen by the Fleet controller. It will be used by kstatus to prevent false negative scenarios such as after Fleet spec is updated, the controller is not running for some reason or has not worked on the Fleet yet. In this case, kstatus will treat the resource reconciliation as still in progress even though the Fleet.status.Reconciling is set to false.

Based on the above criteria, the Fleet controller should set the Reconciling condition of a Fleet to False if all of the following are met:

With the above two conditions combined and Kstatus checking the freshness of status.observedGeneration, it is guaranteed that all the GameServers associated with the Fleet have been updated to the latest version and all of them are available, i.e. the Fleet deployment is complete. And kstatus will see the Fleet as "Current", which means the actual state of the Fleet matches the desired state and the reconcile process is considered complete. Otherwise, if any of the above conditions is not met, the Reconciling condition of the Fleet will be set with status True, and kstatus will consider the Fleet reconciliation is still "in Progress". Then these status generated by kstatus can be used by CD tools such as Cloud Deploy (Skaffold) to expose the binary indicator of rollout completion.

In each of the syncFleet executions of the Fleet controller, the Reconciling condition will be updated, and the status.observedGeneration will be set to the value of the metadata.Generation. There will be no additional load introduced to KCP since we already update the Fleet status on every reconciliation pass.

Let’s go through some example scenarios. Noted that to simplify the process and only show what matters, in the following examples, we do not consider the delay between Agones starts/terminates a GS and the GS becomes Ready/Deleted, i.e. whenever the Fleet controller updates the GSS spec replica count, the new replica will be ready and the old replica will be deleted immediately before the next Fleet controller sync execution.

Scenario 1: Without Allocated GS

Given a Fleet has the desired replica set to 1 and already in the state where it only contains one GSS, and the GSS only has one Ready replica. The observedGeneration of the Fleet is also equal to the Fleet generation. So the Reconciling condition of the Fleet is False. The Fleet’s update strategy is rollingUpdate, maxSurge and maxUnavailable are both 25%.

Stages Fleet Generation GSS with Replica counts Fleet ObservedGeneration Fleet Reconciling condition value Fleet status in Kstatus
Starting state 1 GSS 1 (Active):
  • Spec.replica: 1
  • Ready replica: 1
1 False Current
Fleet is update with a new GS image, but the controller has not acted on the change yet 2 GSS 1 (Active):
  • Spec.replica: 1
  • Ready replica: 1
1 False InProgress
Fleet controller syncing the Fleet, Create the new Active GSS and update the Spec.replica accordingly 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 0

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 1
2 True InProgress
Fleet controller sync the Fleet, new GS is ready and old GS is shutdown 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 1

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 0
2 False Current
Fleet controller sync the Fleet, Inactive GSS is deleted 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 1
2 False Current

Scenario 2: Without Allocated GS

Given a Fleet has the desired replica set to 10 and already in the state where it only contains one GSS, and the GSS only has 10 Ready replicas. The observedGeneration of the Fleet is also equal to the Fleet generation. So the Reconciling condition of the Fleet is False. The Fleet’s update strategy is rollingUpdate, maxSurge and maxUnavailable are both 25%.

Stages Fleet Generation GSS with Replica counts Fleet ObservedGeneration Fleet Reconciling condition value Fleet status in Kstatus
Starting state 1 GSS 1 (Active):
  • Spec.replica: 10
  • Ready replica: 10
1 False Current
Fleet is update with a new GS image, but the controller has not acted on the change yet 2 GSS 1 (Active):
  • Spec.replica: 10
  • Ready replica: 10
1 False InProgress
Fleet controller syncing the Fleet, Create the new Active GSS and update the Spec.replica accordingly 2 GSS 2 (Active):
  • Spec.replica: 3
  • Ready replica: 0

GSS 1 (Inactive):

  • Spec.replica: 8
  • Ready replica: 10
2 True InProgress
Fleet controller syncing the Fleet, update the Spec.replica accordingly 2 GSS 2 (Active):
  • Spec.replica: 5
  • Ready replica: 3

GSS 1 (Inactive):

  • Spec.replica: 5
  • Ready replica: 8
2 True InProgress
Fleet controller syncing the Fleet, update the Spec.replica accordingly 2 GSS 2 (Active):
  • Spec.replica: 8
  • Ready replica: 5

GSS 1 (Inactive):

  • Spec.replica: 3
  • Ready replica: 5
2 True InProgress
Fleet controller syncing the Fleet, update the Spec.replica accordingly 2 GSS 2 (Active):
  • Spec.replica:10
  • Ready replica: 8

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 3
2 True InProgress
Fleet controller sync the Fleet, new active GSS is fully scaled up and the old GSS is fully scaled down 2 GSS 2 (Active):
  • Spec.replica: 10
  • Ready replica: 10

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 0
2 False Current
Fleet controller sync the Fleet, Inactive GSS is deleted 2 GSS 2 (Active):
  • Spec.replica: 10
  • Ready replica: 10
2 False Current

Scenario 3: With Allocated GS

Given a Fleet has the desired replica set to 1 and already in the state where it only contains one GSS, and the GSS only has one Allocated replica. The observedGeneration of the Fleet is also equal to the Fleet generation. So the Reconciling condition of the Fleet is False. The Fleet’s update strategy is rollingUpdate, maxSurge and maxUnavailable are both 25%.

Stages Fleet Generation GSS with Replica counts Fleet ObservedGeneration Fleet Reconciling condition value Fleet status in Kstatus
Starting state 1 GSS 1 (Active):
  • Spec.replica: 1
  • Ready replica: 0
  • Allocated replica: 1
1 False Current
Fleet is update with a new GS image, but the controller has not acted on the change yet 2 GSS 1 (Active):
  • Spec.replica: 1
  • Ready replica: 0
  • Allocated replica: 1
1 False InProgress
Fleet controller syncing the Fleet, since the only replica of the Inactive GSS is still Allocated, Fleet controller cannot scale up the Active GSS and scale down the Inactive GSS 2 GSS 2 (Active):
  • Spec.replica: 0
  • Ready replica: 0
  • Allocated replica: 0

GSS 1 (Inactive):

  • Spec.replica: 1
  • Ready replica: 0
  • Allocated replica: 1
2 True InProgress
Fleet controller syncing the Fleet, since the only replica of the Inactive GSS is still Allocated, Fleet controller cannot scale up the Active GSS 2 GSS 2 (Active):
  • Spec.replica: 0
  • Ready replica: 0
  • Allocated replica: 0

GSS 1 (Inactive):

  • Spec.replica: 1
  • Ready replica: 0
  • Allocated replica: 1
2 True InProgress
Fleet controller syncing the Fleet. At this point, the customer move the Allocated replica to Ready state, so the Fleet controller can make progress and update the GSS Spec replicas accordingly 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 0
  • Allocated replica: 0

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 1
  • Allocated replica: 0
2 True InProgress
Fleet controller sync the Fleet, new GS is ready and old GS is shutdown 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 1
  • Allocated replica: 0

GSS 1 (Inactive):

  • Spec.replica: 0
  • Ready replica: 0
  • Allocated replica: 0
2 False Current
Fleet controller sync the Fleet, Inactive GSS is deleted 2 GSS 2 (Active):
  • Spec.replica: 1
  • Ready replica: 1
  • Allocated replica: 0
2 False Current
markmandel commented 1 year ago

From notes from meeting:

"Done" (for the initial Alpha release) is:

markmandel commented 1 year ago

@castaneai FYI

github-actions[bot] commented 7 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '