Open pooneh-m opened 3 years ago
I'm assuming this is primarily a problem when we are using the Packing algorithm for allocation (would be good to have performance metrics on Distributed vs Packed error rate and throughput), as since we are trying to bin pack the Allocated game servers, which means that the sorted caching on each allocator binary will generally target the same GameServers.
Some thoughts on this:
I must say - this has taken me on quite travel down memory lane to remember how allocation works!
One area of research we should also confirm -- how much is the gRPC endpoint actually loadbalanced?
It's quite possible there is only one pod being used at any given point and time with the current load balancing setup.
Another thought, at this point: https://github.com/googleforgames/agones/blob/master/pkg/gameserverallocations/allocator.go#L547
Instead have it be:
gs, err := c.readyGameServerCache.PatchGameServerMetadata(res.request.gsa.Spec.MetaPatch, res.gs)
if err != nil {
// since we could not allocate, we should put it back
if !k8serrors.IsConflict(err) { // this is the new bit
c.readyGameServerCache.AddToReadyGameServer(gs)
}
res.err = errors.Wrap(err, "error updating allocated gameserver")
} else {
res.gs = gs
c.recorder.Event(res.gs, corev1.EventTypeNormal, string(res.gs.Status.State), "Allocated")
}
Basically, if there is a conflict, let the actual version re-populate the catch from the K8s watch operation, since we know that this version of the GameServer is stale, since it's conflicted on the update.
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions
What happened: Performance testing revealed that
agones-allocator
service, returns contention errors under heavy load when there are multiple replicas for the service.What you expected to happen: The service should allocate game servers under heavy load with no contention error.
How to reproduce it (as minimally and precisely as possible): Run a performance test with 50 parallel clients and 4000 gameservers.
Anything else we need to know?: The
agones-allocator
pods are caching game servers in memory. Because the state of game servers are changed by different pods, the cache could quickly go out of sync with the state of game servers in the cluster. Either the cache should be changed to a key-value store shared between the pods or allocators watch game server changes and use k8s API to get a ready game server using its labels without caching them.Documentations should also provide recommendations on the # of replica for packed and distributed allocation.
Environment:
kubectl version
): 1.16