Poor packing with agones-allocator and Counter for high-capacity game servers

dmorgan-blizz commented 1 month ago

What happened:

Allocated game servers are not efficiently packed, even though they have plenty of Counter space: game_players

As you can see, when making allocation requests, Agones starts to fill up the first server, but then, before that server is anywhere close to its 300 Counter capacity, it starts sending players to the Ready buffer server. Now there are multiple game servers active, but even though those servers also have plenty of space, Agones will still occasionally allocate more Ready buffer servers.

This also occurred when scaling agones-allocator replicas down to 1 (from 3), and actually, the packing was worse in this scenario, even with an allocation rate as low as 5/s.

What you expected to happen: Agones should always pick an Allocated game server with available Counter capacity before choosing a Ready buffer server.

How to reproduce it (as minimally and precisely as possible): Game server settings like the following:

Kind: FleetAutoscaler
Spec:
  Policy:
    Type: Buffer
    Buffer:
      Buffer Size: 1

kind: Fleet
spec:
  scheduling: Packed
  template:
    spec:
      counters:
        players:
          capacity: 300

where a game server runs for a period of time and players continually cycle in and out (using SDK's DecrementCounterAsync(string key, long amount) upon game end);

and an allocation request like the following

GameServerSelectors = {
    new GameServerSelector {
        GameServerState = GameServerSelector.Types.GameServerState.Allocated,
        MatchLabels = {
            { "version", "1.2.3" },
        },
        Counters = {
            {
                "players", new CounterSelector {
                    MinAvailable = 1,
                }
            },
        },
    },
    new GameServerSelector {
        GameServerState = GameServerSelector.Types.GameServerState.Ready,
        MatchLabels = {
            { "version", "1.2.3" },
        },
        Counters = {
            {
                "players", new CounterSelector {
                    MinAvailable = 1,
                }
            },
        },
    },
},
Counters = {
    {
        "players", new CounterAction {
            Action = "Increment",
            Amount = 1,
        }
    },
},

using https://github.com/googleforgames/agones/blob/release-1.41.0/proto/allocation/allocation.proto and gRPC

agones:
  agones:
    allocator:
      service:
        http:
          enabled: false
        grpc:
          enabled: true
        serviceType: ClusterIP

Anything else we need to know?: agones-allocator game server state does not seem to be shared between replicas (which seems like it should also be fixed, maybe Redis or something?), but as mentioned, even with only one replica, the packing was as bad or worse

Environment:

Agones version: 1.41.0
Kubernetes version (use kubectl version): 1.28.7-gke.1026001
Cloud provider or hardware configuration: GCP
Install method (yaml/helm): helm
Troubleshooting guide log(s):
Others:

igooch commented 1 month ago

Could you try testing with only one node? To rule out whether or not it's hitting these lines and returning before it gets to the CountsAndLists logic https://github.com/googleforgames/agones/blob/77face16d95d921ec65b902d7e26ee811598c601/pkg/gameserverallocations/allocation_cache.go#L202-L209

Jensaarai commented 1 month ago

Could you try testing with only one node? To rule out whether or not it's hitting these lines and returning before it gets to the CountsAndLists logic

We can try, that will take longer to set up though so it might be a bit before I can report results

markmandel commented 2 weeks ago

I am wondering if this has more to do with the way we cache during allocation.

i.e.

https://github.com/googleforgames/agones/blob/a27aeab7e2405e491f6cb27e43721131ea21c6e6/pkg/gameserverallocations/allocator.go#L563-L567

Were we drop the allocated gameserver from the cache, and it may take a while to come back (but 5s seems way too long).

I wonder what would happen if we took an Allocated GameServer and put the allocated state back into the cache after we go it 🤔

Edit:

Something to try is here: https://github.com/googleforgames/agones/blob/a27aeab7e2405e491f6cb27e43721131ea21c6e6/pkg/gameserverallocations/allocator.go#L595

To put it back into allocationCache on successful allocation or re-allocation I think would make a lot of sense.

markmandel commented 2 weeks ago

Another thought on a workaround potentially (or to make things better), try with a shorter batch time on allocation, as that will refresh the list of potential allocated gameservers that the allocator looks at more often (but higher CPU usage).

googleforgames / agones

Poor packing with agones-allocator and Counter for high-capacity game servers #3992