googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6k stars 791 forks source link

Contentions errors from allocator gRPC service under high load #1856

Open pooneh-m opened 3 years ago

pooneh-m commented 3 years ago

What happened: Performance testing revealed that agones-allocator service, returns contention errors under heavy load when there are multiple replicas for the service.

What you expected to happen: The service should allocate game servers under heavy load with no contention error.

How to reproduce it (as minimally and precisely as possible): Run a performance test with 50 parallel clients and 4000 gameservers.

Anything else we need to know?: The agones-allocator pods are caching game servers in memory. Because the state of game servers are changed by different pods, the cache could quickly go out of sync with the state of game servers in the cluster. Either the cache should be changed to a key-value store shared between the pods or allocators watch game server changes and use k8s API to get a ready game server using its labels without caching them.

Documentations should also provide recommendations on the # of replica for packed and distributed allocation.

Environment:

markmandel commented 3 years ago

I'm assuming this is primarily a problem when we are using the Packing algorithm for allocation (would be good to have performance metrics on Distributed vs Packed error rate and throughput), as since we are trying to bin pack the Allocated game servers, which means that the sorted caching on each allocator binary will generally target the same GameServers.

Some thoughts on this:

I must say - this has taken me on quite travel down memory lane to remember how allocation works!

markmandel commented 3 years ago

One area of research we should also confirm -- how much is the gRPC endpoint actually loadbalanced?

It's quite possible there is only one pod being used at any given point and time with the current load balancing setup.

markmandel commented 3 years ago

Another thought, at this point: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L547

Instead have it be:

                    gs, err := c.readyGameServerCache.PatchGameServerMetadata(res.request.gsa.Spec.MetaPatch, res.gs)
                    if err != nil {
                        // since we could not allocate, we should put it back
                        if !k8serrors.IsConflict(err) { // this is the new bit
                                                   c.readyGameServerCache.AddToReadyGameServer(gs)
                                                }
                        res.err = errors.Wrap(err, "error updating allocated gameserver")
                    } else {
                        res.gs = gs
                        c.recorder.Event(res.gs, corev1.EventTypeNormal, string(res.gs.Status.State), "Allocated")
                    }

Basically, if there is a conflict, let the actual version re-populate the catch from the K8s watch operation, since we know that this version of the GameServer is stale, since it's conflicted on the update.

github-actions[bot] commented 11 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] commented 10 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] commented 8 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] commented 7 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] commented 6 months ago

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions