googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.09k stars 812 forks source link

Replace global allocation mutex with fine-grained concurrency controls. #535

Closed jkowalski closed 5 years ago

jkowalski commented 5 years ago

Currently there's a global mutex shared between 4 controllers that prevents allocations from happening when there's a deletion of (any) game server going on, regardless of a fleet it's on, etc.

To get decent allocation throughput this mutex should be removed and we should start relying on conditional mutations of GameServer itself to ensure correctness.

markmandel commented 5 years ago

The biggest issue here is - how to delete safely without accidentally deleting something that is being concurrently allocated.

Or "only delete this object if it's equal to this revision/generation"

markmandel commented 5 years ago

Here's a possibly crazy idea - we could reject the deletion with a webhook! The webhook could check if the object is allocated (pretty sure this is 100% up to date), and if it is, reject the deletion.

I think this will work!

Edit: removed idea, as this below is better.

jkowalski commented 5 years ago

This can be done by adding something like "Deleting" status and using regular Update:

Instead of deleting, callers would set "Deleting" status (via regular Update which resolves race conditions) and GS controller would be the only one to trigger actual deletion for GS in Deleting state.

jkowalski commented 5 years ago

BTW. As soon as we remove the mutex, we will see increased contention on Game Server resources. That can be fixed with batching and randomization of GS to allocate (see #536).

markmandel commented 5 years ago

This can be done by adding something like "Deleting" status and using regular Update:

You know what - we already have this functionality, because of the SDK. It's called Shutdown state. https://github.com/GoogleCloudPlatform/agones/blob/master/pkg/gameservers/controller.go#L619-L636

Good call. I like it :+1:

ilkercelikyilmaz commented 5 years ago

I think I can work on this.

thisisnotapril commented 5 years ago

@jkowalski @ilkercelikyilmaz assignment made!

ilkercelikyilmaz commented 5 years ago

PR #572 fixes this partially. I will continue on the allocation improvements so we can get rid-off the Allocation Mutex