googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.01k stars 796 forks source link

Dangling Game Servers left behind if allocation fails at the wrong moment #607

Open jkowalski opened 5 years ago

jkowalski commented 5 years ago

There's a rare race condition in Game Server Allocation which leads to dangling game servers that will never be cleaned up.

Basically if a call to allocate the server partially succeeds by marking the GS as Allocated but failing to return the result to the user (for example because of a call timeout or a connection getting dropped at the wrong moment), the user will never know which server was allocated in their call. Even worse, such server will never be in use and will stay Allocated forever, preventing fleet scaling down, etc.

One solution would be to purge Allocated game servers that don't have actual game played on them after a long while. To do that we need to know which GS are active and which are not.

How about the following strategy:

This would obviously be a breaking change for current DGS who don't report the correct health information yet, so we'd need an opt-in or opt-out mechanism for the transition period to avoid getting those GS killed.

markmandel commented 5 years ago

Oh yep - this is fun. And in reality, likely to happen in any distributed system, because weird stuff always does.

2 thoughts on the above:

we add new call to the SDK that indicates the game is really being played Long term I think we will need to add some form of player tracking to the SDK - it's coming up a lot with conversations I'm having with the openmatch team. How deep the feature set is an ongoing question.

any DGS that does not receive health signal indicating that the game is being played So in theory - a DGS could do this now, without changes to Agones. It would be ideal if we handled this as well in the SDK, but you could tie your healthcheck ping to playercount, and if after N minutes, nothing happens, stop sending it back as Healthy() (or you could also call sdk.Shutdown() too?)

But this probably really goes to a wider conversation about player tracking, and how we want to manage that.

cyriltovena commented 5 years ago

Adding a mandatory call from the SDK when the game start seems a good solution without having to handle player tracking for now.

And I agree, we need opt-in or opt-out, I think opt-out is better since we want our users to not have dangling GameServer by default.

ilkercelikyilmaz commented 5 years ago

Maybe we should consider having another state called "InUse". After the GS state is set to Allocated, game running on the GS should call Agones to change the state from Allocated to InUse. If the GS is in Allocated state for X minutes, system can shut it down and release the resources.

markmandel commented 5 years ago

At some point in the near future, we will need to track at the very least - max number of players a DGS can handle, and the number of players currently playing - to make it easier to support backfill for matchmakers (or even just front-fill).

This feels all related to be.

cyriltovena commented 5 years ago

I'm wondering if the game server could handle this itself by shutting down after a given timeout.

markmandel commented 5 years ago

I'm wondering if we (optionally) track player counts/ids per game server, we could have some kind of timeout to shutdown if allocated but no players connect after a certain period? (Also optional)

castaneai commented 4 years ago

Hi. I found the same problem.

In games where a small number of players match, the server that fails the allocation is still the dangling server. If the player retries the allocation, a new server will be allocated and the dangling server will remain. (Matchmaker has no way of knowing which server failed to allocate)

Currently I am implementing PlayerCount management and @jkowalski 's solution on the GameServer side that If PlayerCount is zero for a certain period in Allocated state, Shutdown is called automatically.

This issue can occur in a variety of games, so it would be nice if Agones could provide a solution!

markmandel commented 4 years ago

We are part of the way there, with Player Tracking being in alpha right now!

Not sure if we should wait until GA for this feature, but what would you like this to look like? Some sort of additional configuration option on a GameServer yaml?

castaneai commented 4 years ago

I agree. It is a good idea to avoid the dangling server by writing it in GameServer yaml.

markmandel commented 1 year ago

This still seems like a good idea, but will get moved to https://github.com/googleforgames/agones/issues/2716 once it's implemented.

github-actions[bot] commented 1 year ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

markmandel commented 1 year ago

Added awaiting-maintainer as I think we should definitely still do this one #2716 is complete.