googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
5.97k stars 788 forks source link

Add `podMetadata` to Allocation API #2975

Closed Reasonably closed 1 year ago

Reasonably commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I encountered two issues while using Agones that could be resolved with a new feature.

Firstly, in my game, a very large number of sessions are used, and setting the safe-to-evict flag to false in the pod specification causes the cluster autoscaler (CA) to not work properly, resulting in wasted costs. However, setting safe-to-evict to true is also not a viable solution, as session durations vary from 20 minutes to 3 hours after allocation. So, we added a code modification to the Agones allocator to change the safe-to-evict annotation to false at the moment of allocation, which resulted in a branching of the official Agones code.

Secondly, in my game, various information such as the map and match type is determined at the moment of allocation and not at the creation of the GameServer. I know that it is currently possible to modify the metadata of the GameServer. However, modifying the metadata of the Pod is currently not possible, and this metadata is more useful than the GameServer's metadata from a telemetry perspective.

Describe the solution you'd like A clear and concise description of what you want to happen.

I propose that a new feature be added to Agones that would allow the modification of the metadata of the Pod during allocation time. This can be achieved by adding the ability to modify the label and annotation of pods through the Allocation API. ex) Add podMetadata to Allocation API. With this feature, safe-to-evict can be dynamically changed from the annotation, and multiple data points determined at allocation time can be reflected in the Pod's metadata.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

An alternative solution could be to add the syncGameServerMetadataWithPods option to the GameServer resource and features to the controller. If this feature is added, modifying the metadata of GameServer at allocation time will reflect it in pod, so you can see the same effect.

Additional context Add any other context or screenshots about the feature request here.

I have considered developing this function myself.

markmandel commented 1 year ago

Firstly, in my game, a very large number of sessions are used, and setting the safe-to-evict flag to false in the pod specification causes the cluster autoscaler (CA) to not work properly, resulting in wasted costs.

What does "not work properly" mean to you here?

So, we added a code modification to the Agones allocator to change the safe-to-evict annotation to false at the moment of allocation, which resulted in a branching of the official Agones code.

The reason we never did this in the Agones code is because of the possibility for race conditions. It's entirely possible that a node shutdown could occur around the same time as an allocation -- so this seems very risky to me.

However, modifying the metadata of the Pod is currently not possible, and this metadata is more useful than the GameServer's metadata from a telemetry perspective.

Can you expand on this? Why is this more useful from a telemetry perspective?

Reasonably commented 1 year ago

What does "not work properly" mean to you here?

I'm using a packed scheduling strategy, but it does not work perfectly. Inevitably, there are times when pods remain on a node, and since safe-to-evict is false, the CA does not delete the node. In such cases, the node remains active and costs are incurred.

The image on the first shows before I added this feature, and the image on the next shows after adding the feature. Previously, scaling did not work well, but after adding the feature, it works well. (I put in more game servers in October than in September) image image

The reason we never did this in the Agones code is because of the possibility for race conditions. It's entirely possible that a node shutdown could occur around the same time as an allocation -- so this seems very risky to me.

I was aware of this and added logic to avoid allocating resources on nodes tainted by the CA. I didn't mention this feature earlier as it seemed like a separate topic from this discussion, but I should have mentioned it

Can you expand on this? Why is this more useful from a telemetry perspective?

For example, if using DataDog, various data about the Pod can be collected, such as CPU and memory usage. If metadata is added to the Pod, it can be utilized in the metrics, but if it is added only to the GameServer, it cannot be utilized.

Most telemetry works based on the general workloads of Kubernetes. Mixing the GameServer's metadata with the Pod's metadata is possible, but additional work would be required

Reasonably commented 1 year ago

Additionally, our sessions are deleted when the game ends.
Therefore, session creation and deletion are frequent, and this may be why 'packed' schedule strategy does not work as expected. However, changing this logic would require significant effort, so it is difficult to make changes at this time.

markmandel commented 1 year ago

I'm using a packed scheduling strategy, but it does not work perfectly. Inevitably, there are times when pods remain on a node, and since safe-to-evict is false, the CA does not delete the node. In such cases, the node remains active and costs are incurred.

πŸ€” I am genuinely wondering if what you are seeing is a symptom of #2974 (which just got caught and fixed), and your solution was to allow the CA to re-pack GameServer pods?

The reason we never did this in the Agones code is because of the possibility for race conditions. It's entirely possible that a node shutdown could occur around the same time as an allocation -- so this seems very risky to me.

I was aware of this and added logic to avoid allocating resources on nodes tainted by the CA. I didn't mention this feature earlier as it seemed like a separate topic from this discussion, but I should have mentioned it

I am wondering how you can ever escape the race condition?

Even if you get the latest node information directly from the k8s api (not through an informer, which would put extra load on the control plane, which would affect throughput and scaling speed), you could still have:

  1. Do check, see node is fine
  2. CA scales down node
  3. Allocation allocated GameServer
  4. GameServer now gets interupted.

Unless you have a fancy way of implementing this I'm not thinking of?

@zmerlynn (who is on leave atm, so may take some time to respond), has been doing a lot of work on managing eviction, maybe you have thoughts?

Additionally, our sessions are deleted when the game ends. Therefore, session creation and deletion are frequent, and this may be why 'packed' schedule strategy does not work as expected.

I'm assuming you mean each GameServer is deleted as the game ends? If so, packed should work better for you, since there's more opportunity for scale down / reshuffling of GameServers in the cluster.

Can you expand on this? Why is this more useful from a telemetry perspective?

For example, if using DataDog, various data about the Pod can be collected, such as CPU and memory usage. If metadata is added to the Pod, it can be utilized in the metrics, but if it is added only to the GameServer, it cannot be utilized.

πŸ€” Is there another way you could expose this information up to Datadog? Since the game server binary already has the label information, could it append it to the metrics service somehow?

I say this because we have to be cognisant of how much load we put on the control plane, since that directly affects scaling time and allocation throughput. Copying metadata from the GameServer to the Pod just for metrics seems like extra load we should try and avoid if there another way for you to expose that data to metric collection agents.

Reasonably commented 1 year ago

I'm using a packed scheduling strategy, but it does not work perfectly. Inevitably, there are times when pods remain on a node, and since safe-to-evict is false, the CA does not delete the node. In such cases, the node remains active and costs are incurred.

πŸ€” I am genuinely wondering if what you are seeing is a symptom of https://github.com/googleforgames/agones/pull/2974 (which just got caught and fixed), and your solution was to allow the CA to re-pack GameServer pods?

The reason we never did this in the Agones code is because of the possibility for race conditions. It's entirely possible that a node shutdown could occur around the same time as an allocation -- so this seems very risky to me.

I was aware of this and added logic to avoid allocating resources on nodes tainted by the CA. I didn't mention this feature earlier as it seemed like a separate topic from this discussion, but I should have mentioned it

I am wondering how you can ever escape the race condition?

Even if you get the latest node information directly from the k8s api (not through an informer, which would put extra load on the control plane, which would affect throughput and scaling speed), you could still have:

Do check, see node is fine CA scales down node Allocation allocated GameServer GameServer now gets interupted. Unless you have a fancy way of implementing this I'm not thinking of?

@zmerlynn (who is on leave atm, so may take some time to respond), has been doing a lot of work on managing eviction, maybe you have thoughts?

You are right. I mentioned the issue after looking at the document and code that the predecessor had written. The document also mentioned avoiding tainted nodes "as much as possible" to prevent race conditions. Currently, there are some GameServers that remain in the allocated state and do not have Pods in existence. I suspect that these servers may have been affected by the condition you mentioned. I've been using this feature for live so far, and I think it's because there are very few sessions that can cause problems compared to the sessions in the game. CA does not immediately delete the node, but rather marks it as a candidate for deletion, and Agones sets the safe-to-evict to false during allocation, which suggests that there is a reduced potential for problems. The pull request you mentioned seems to have just been merged, so I will update and test it and then inform you of the results. According to his predecessor's document, the reason why scaling was unsatisfactory is that even if the GameServer itself is packed as much as possible through affinity, it is still not optimal.

Can you expand on this? Why is this more useful from a telemetry perspective?

For example, if using DataDog, various data about the Pod can be collected, such as CPU and memory usage. If metadata is added to the Pod, it can be utilized in the metrics, but if it is added only to the GameServer, it cannot be utilized.

πŸ€” Is there another way you could expose this information up to Datadog? Since the game server binary already has the label information, could it append it to the metrics service somehow?

The reason I created this FR is actually not because of the safe-to-evict issue, but because of this issue. As I mentioned before, a lot of metadata is determined at allocation time for the game that I am servicing. I can't say why I have to do that because I think it's going to be a long story, but to put it simply, there's a dedicated server binary in this game, and the argument that runs this binary is determined at allocation time. The reason why I have to do it that way is a bit complicated, but to put it simply, the dedicated server binary for this game is one, and the arguments that run this binary are determined at allocation time. I would like to change this structure, but it is not possible. Therefore, it is not possible to know which map the dedicated server is running, whether it is in solo/team mode, normal/competitive mode, just by looking at the pod. Of course, the solution you suggested is also not possible.

I say this because we have to be cognisant of how much load we put on the control plane, since that directly affects scaling time and allocation throughput. Copying metadata from the GameServer to the Pod just for metrics seems like extra load we should try and avoid if there another way for you to expose that data to metric collection agents.

The solution I proposed was to add podMetadata to the Allocation API. This item is optional, so in most cases, there should be no problem.

markmandel commented 1 year ago

Of course, the solution you suggested is also not possible.

I'm not as familiar with the datadog api as you are, but, just looking through their docs: I believe this is what you are wanting to take advantage of: https://docs.datadoghq.com/containers/kubernetes/tag/?tab=containerizedagent#pod-labels-as-tags

Could you instead write a sidecar/small controller for https://docs.datadoghq.com/api/latest/tags/#update-host-tags that looks at GameServer state through the Agones SDK and updates your Pod's tags that way?

I'm just trying to avoid control plane churn.

It sounds like a good third party project though.

zmerlynn commented 1 year ago

Long term, resetting the safe-to-evict flag is not what we want to promote, IMO. The longer term solution in Kubernetes is to allow the terminationGracePeriodSeconds to be configured much higher, but this relies on support for parallel drain support in Cluster Autoscaler. Luckily that was implemented under https://github.com/kubernetes/autoscaler/issues/5079 and is experimental in 1.26.

With higher permitted terminationGracePeriodSeconds, the game server just needs to handle SIGTERM and ignore it for the session length, assuming the session length still fits under the new maximums. This will obviously vary across cloud providers, etc., but is the pattern we really want to support, where SIGTERM will tell the gameserver that CA wants to compact.

Reasonably commented 1 year ago

I put the change #2974 you mentioned to the code and tested it, and the results showed no difference in scaling performance from the control (blue is control, yellow is experimental) image So we decided to remove the "safe-to-evict false logic".

Regarding the pod metadata, I will try to add the logic to access the K8S API elsewhere as you suggested. Your response was helpful and I will close this issue.

Thank you.