googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6k stars 791 forks source link

How to easily keep updating agones to the latest version #1742

Open axot opened 4 years ago

axot commented 4 years ago

Is your feature request related to a problem? Please describe. We are thinking about how to introduce agones to our more and more projects. The problem we're facing now is, k8s has 4 minor upgrade in a single year. This meaning we have to switch agones k8s cluster to the new one for each time. If there are 5 projects are using agones this becomes 5x4=20 times a year. And a single operation to switch all realtime traffic to new cluster also risk and take much operation time. It will be an almost impossible mission for our limited k8s engineers.

Describe alternatives you've considered Another thing to consider is, agones is heavily using CRD and custom controller, this meaning we have to figure out all changes in source code level. And do integration test for each version.

Describe the solution you'd like We want to discuss together to find a safer and less costly way to operate.

markmandel commented 4 years ago

@axot - super interesting questions.

I'm curious does https://agones.dev/site/docs/installation/upgrading/ help at all?

Also, I would love to know, how automated is your infrastructure setup and testing? Are you doing it all by hand, or automating with something like Terraform and having a CI/CD pipeline to push out and test new versions of your game?

axot commented 4 years ago

Hi,

I'm curious does https://agones.dev/site/docs/installation/upgrading/ help at all?

Yes, we are planning to use Multiple Clusters strategy for fast rollback to reduce business impact.

For the question of infrastructure automation. Yes, we are using Terraform to Build GKE related resources and GitlabCI, Spinnaker for CI/CD. There also are some manual parts of it.

For example,

  1. Setup new GKE information to Spinnaker, and update related pipeline and more.
  2. Helm charts / k8s manifest were installed by hand, ArgoCD is under testing, but need more time for production ready.
  3. The team which using agones did not maintenance k8s part, my team take responsibility to build all resources exclude the application code.
  4. Have to create the upgrade manual and perform online GKE cluster switching.
markmandel commented 4 years ago

Curious - are there specific blocking issues that stop those manual steps from being automatic? or automated on manual keypress?

Also what are we looking for in a solution here? A best practices document? Changes to Agones? (if so, what changes?) Something else?

axot commented 4 years ago

Let me talk with other members next week to confirm the rest work for achieving automatic release process!

markmandel commented 4 years ago

Thanks! This is super good questions, and a good topic to get stuck into.

As an aside, been wanting a "Solutions" section to the documentation for some opinionated best practices for specific scenarios for a while, so if that's were we're headed, then :+1:

roberthbailey commented 4 years ago

Another alternative to consider, which I believe is that @steven-supersolid has said his team does, is that they pick a k8s+agones pair each time they roll out a new release.

So instead of worrying about how many k8s / agones upgrades you need to do per year, instead you say that for each release (every week / month / quarter / etc) you pick the version that you want to qualify and support (upgrading when you choose) and roll that out to production. Then on the next release you are free to pick another version.

This requires using new clusters for each release, but also allows you to skip agones / k8s releases if you don't need to use them rather than worrying about always keeping up with the latest release.

axot commented 4 years ago

The opinion of k8s+agones pair is more easier to apply to the current environment with less effort.

In our case, we are using managed k8s(GKE), one difficult thing is we have to contact GCP side to disable auto upgrade feature for these specific clusters. So if any security patch need to be applied to GKE/node/agones, we then plan to upgrade to the latest k8s+agones pair.

This is a viable option for us. Internally we are keep discussing how to make infrastructure operation more automatically.

Thanks!

roberthbailey commented 4 years ago

One difficult thing is we have to contact GCP side to disable auto upgrade feature for these specific clusters.

You cannot disable auto upgrade for patch releases (often security fixes). And if you wait long enough, you will eventually be forced to upgrade to a new minor version as well.

The idea of picking the k8s+agones pair is that if you do it frequently enough (say, more often that GKE upgrades minor releases) then you can avoid having clusters upgrade in place and instead replace aging clusters with new ones at the new version as part of your normal rollout.

Security patches for k8s are generally backported 3 or so minor versions, so even if you don't upgrade to the latest k8s on GKE you should still be getting patches as they come out. To date Agones has only applied patches to the latest release (and back porting fixes is a lot of work that we haven't yet seen the need for).

So what we would recommend is picking up the latest Agones release, paring it with the supported k8s version on GKE, and rolling those out once you have qualified them in your environment.

axot commented 4 years ago

Thank for for clarifying the details.

If my understand is correct, to make k8s+agones pair works smoothly, the key is how to automate the release process. I will feedback AFAP once our internal discussion was done.

steven-supersolid commented 4 years ago

We are not fully automated yet so one thing we do to save time is to maintain a dormant set of clusters per project. This way the k8s and agones upgrade can happen in place so time is saved recreating clusters.

We've found the GKE support schedule for k8s versions does not force us to upgrade. E.g. k8s 1,14 is still available

axot commented 4 years ago

There are two main factors that we've been talking about within the team that are inhibiting.

One is a little bit more about the organization, which team is responsible for maintenance, and I'll skip this part.

The other technical factor is that we're using Spinnaker for continuous deployment, but Spinnaker is not good at updating config like cluster information, pipeline dynamically, so for example using gitlab-ci instead would be a bit easier to achieve this.

roberthbailey commented 1 year ago

From @aimuz in https://github.com/googleforgames/agones/issues/2843:

After reading the upgrade documentation, I was frustrated, I found that he doesn't support smooth upgrades, which I think is a big flaw in the feature. I think we should support non-stop upgrades that,

When using k8s, we will consider deploying different services inside a cluster if we want to If we have to migrate the whole cluster for agones upgrade, I think this may be a drawback. In order to use agones, we have to build a cluster dedicated to agones, so that we can avoid migrating other applications and other services when upgrading agones.

The cluster where agones is located it has a lot of things, such as monitoring, persistence, host optimization, etc., then these means that it needs to be done again, although there are quite a few tools to simplify this part of the work. But I think this is still unnecessary. That's why we should support smooth upgrades. Avoid migrating clusters in a way.

Perhaps, the way to upgrade we can refer to istio

github-actions[bot] commented 1 year ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '