googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.04k stars 800 forks source link

SDK Server: Adopt Sidecar Containers #3642

Open markmandel opened 7 months ago

markmandel commented 7 months ago

Is your feature request related to a problem? Please describe.

The current Health Failure Strategy and lifecycle for the sdkserver and game server container works, but it has edge cases, and is eventually consistent - which can also be fun (but not always in a good way).

This means as new ways for Pods to fail come up we have build out new features to capture this, and it's not straightforward, and often puts more load on the K8s control plane.

Describe the solution you'd like

Once Agones supports Kubernetes 1.29+ (or maybe earlier behind a feature flag?) we could move the sdkserver to the new Sidecar container model.

Then we can set the Pod to restart: Never by default, and the sdkserver sidecar to restart: Always, which would simplify things greatly.

There is one thing that is definitely tricky here - the way we have things setup now, if a GameServer is before Ready, we let it restart. If it's after Ready, we do not. Maybe we should revisit this as a pattern we still need for simplification - especially if the Pod crashes before being Ready, a new one will be recreated 🤔 Although this breaks backward compatibility, and that is less than ideal.

Describe alternatives you've considered Leave things as they are, they seem to mostly work!

Additional context

N/A

markmandel commented 6 months ago

Just copying a comment from an offline conversation with a user -- "the atomic non-restart to Unhealthy is way more important to us than being able to restart a game server process before it moves to Ready - so this is a good change".

unlightable commented 3 months ago

Just copying a comment from an offline conversation with a user -- "the atomic non-restart to Unhealthy is way more important to us than being able to restart a game server process before it moves to Ready - so this is a good change".

Can state the same. Loosing unready GameServer is not a price on an autoscaling system. Even loosing a Ready but not yet Allocated one doesn't look very harsh. But thinking server is Ready while it somehow isn't will pretty much always lead to problems.

swermin commented 3 months ago

We encountered this problem recently and I do agree that simplifying this flow would be a tremendous help. We have our pods set to restart: Never since we do want the game servers to cycle as fast as possible. But with this setup if the game server terminates before reaching the Ready state then that pod is forever in a state where we cannot re-use it. I do understand the necessecity of being able to restart a pod if it terminates before reaching the ready state, but in our case the only thing that can make the server terminate is if we have a crash so a restart always would just make the whole pod be in crashloop. To solve this quirk we need some hacks with prestop hooks to get it to be recycled properly.

Long story short, this is a good change!

github-actions[bot] commented 4 weeks ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '