googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
5.97k stars 787 forks source link

Make batchWaitTime configurable in the Allocator #2586

Closed valentintorikian closed 2 years ago

valentintorikian commented 2 years ago

Is your feature request related to a problem? Please describe. We'd like to make the batchWaitTime configurable on the Alocator. On certain allocation pattern, the hard-coded 500ms sleep is too aggressive and degrading allocation latency.

Describe the solution you'd like Make the batchWaitTime configurable through an environment variable, with a default set to 500ms to avoid any breaking change.

Describe alternatives you've considered

Those two solutions would lead to a change in existing behavior so I think they are not to be considered.

Additional context From our internal testing the impact on CPU usage is negligible but the uplift in p95 allocation latency as well as the throughput in allocation per second is significant. In any case the default value won't change so existing behaviour is kept intact.

roberthbailey commented 2 years ago

Relevant code block: https://github.com/googleforgames/agones/blob/a8993fe73938fc9f924b0732e960a5111c9bd5af/pkg/gameserverallocations/allocator.go#L457-L537

roberthbailey commented 2 years ago

You might run into any other issues decreasing the batching time, e.g. more load on system listing and sorting gameservers. But if the batch time is configurable then you can tune the balance for your game between the allocation latency and the amount of CPU you are willing to churn.

markmandel commented 2 years ago

From our internal testing the impact on CPU usage is negligible but the uplift in p95 allocation latency as well as the throughput in allocation per second is significant.

Curious - what change did you make to the batch time that made this change?

valentintorikian commented 2 years ago

We built two images with different hardcoded batchWaitTime, one configured with 200ms and one with 100ms. And ran an allocation scenario consisting of around 0.5alloc/second.

The node hosting the controller is an ec2 t3.xlarge

With the default 500ms value we have the following results: p99 = ~600ms p95 = ~250ms p90 = ~240ms p50 = ~175ms

With the 200ms value: p99 = ~235ms p95 = ~185ms p90 = ~115ms p50 = ~65ms

With the 100ms value: p99 = ~245ms p95 = ~137ms p90 = ~100ms p50 = ~65ms

CPU/memory usage was comparable accross all three test cases.

markmandel commented 2 years ago

How did you see it changing throughput? What sort of scenario did you run it against? (i.e. size of cluster, churn in cluster, etc?)

valentintorikian commented 2 years ago

Cluster is configured with a fleet of 200 GS, they are shutdown 60s after allocation. The cluster is composed of 8 nodes:

Coming back to our monitoring data: image The annotation are as follow 500ms test, 200ms test, 100ms test.

This is with a 5m rate interval

The allocation rate uplift is minor but present, although our goal is to reduce the latency as much as possible, regardless of CPU churn.

markmandel commented 2 years ago

I guess my thoughts here are:

  1. Is 200 GameServers within a cluster of 8 nodes in a cluster a usual amount for your workload?
  2. Is the churn rate in the cluster that you currently have indicative of what you will have in production
  3. Once the churn rate increases, will the bottleneck of etcd basically remove any performance gains you are likely to get from making this change.
  4. Is this an optimisation that works in isolation for this specific test case, but when facing a real world test is not actually going to change much? (As I'm sure you are aware, Kubernetes performance metrics change drastically based on cluster size, pod churn rate, etc).

Not to say I'm against making this a configurable value - just wanting to make sure that the effort to change it is worth it, and we aren't looking at a premature optimisation.

valentintorikian commented 2 years ago

GameServer count per node as well as cluster topology (node size AND node count) can vary greatly based on which game is consuming Agones.

Churn rate is indeed artificially high in those test cases indeed, and the tests are very synthetic in the rate of allocation.

With all that taken into account, making that sleep configurable would allow us to better tailor the performance on a game per game basis.

I understand your concern of premature optimization, but from my point of view, the effort to get that additional knob is pretty minimal as well as being backward compatible because we retain the default 500ms value.

markmandel commented 2 years ago

Totally fair enough - just wanted to ask the question to make sure 👍🏻