googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.12k stars 818 forks source link

Kubernetes-Native Service Load Balancer Support for GameServer #3906

Open stevefan1999-personal opened 4 months ago

stevefan1999-personal commented 4 months ago

While I do appreciate Quilkin that it is Rust based and security-oriented, but it is clearly a no-goer for many other more general game server use cases, either it only works for some niche that is specifically designed for that proxy, or if you can tolerate low performance. Not only it has a high jitter and high tail-latency, but also how it is designed to handle UDP traffic only with unnecessary traffic analyzers/filters that often add more bloat to the performance -- it is clearly not something eSports ready, per my testing with CS2, and worst of all, not all game servers use UDP, like Minecraft (it uses TCP port 25565) and other WebSocket based game server like Q3A in web browser.

As such, I would propose a way to expose the GameServer in Agones to automatically generate a Service with LoadBalancer type specifically owned by the GameServer, in the Agones controller instead, by using ownerReference which automatically terminates the Service also when the GameServer is deleted as part of the garbage collection. We also need to handle labels and annotations to attach in the load balancer because load balancer IP address force assignment is often handled by annotations.

I written an external controller PoC to do it. As I briefly explained in https://github.com/googleforgames/agones/issues/3804#issuecomment-2227585224, I have been using Cilium and its Node IPAM LB to get a high performing L4 load balancer with DDoS protection (integrated to the proxying server itself via Cloudflare Tunnel), so it works regardless of TCP or UDP, it can even carry out SCTP traffic if I want to. But in fact, this could work out to be more generic as we could use other load balancers like MetalLB, LoxiLB, HAProxy, AWS or even GCP's own (luxury-ass) cloud load balancer as well. If you self-host your Kubernetes with k3s you can even use their integrated Klipper "load balancer" which is just iptables SNAT that still works pretty great. (As a cloud Kubernetes engineer and handles not just game servers alone, I really want to rant about the sheer expensiveness of load balancers provided by the public clouds, heh)

Despite this kinda works, I still found the controller hard to keep up with GameServer updates and usually causes "resource version too old" problem, and causes the operator reconcilers to havoc in chaos -- it works well generally in low volume though, but I'm pessimistic about the performance of the external controller in the long run...

So...what if we can have it integrated into the Agones controller and GameServer itself? It would be easy like adding a new loadBalancer field to the GameServer spec.

spec:
  loadBalancer:
    enabled: true
    loadBalancerClass: io.cilium/node 
    annotations:
      lbipam.cilium.io/sharing-cross-namespace: "*"
      lbipam.cilium.io/sharing-key: " "

(For reference: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#servicespec-v1-core)

For the functional requirements, ports should also be automatically determined by spec.ports in GameServer as well, and then we have also need to have a label selector for the service to select the pod generated by the GameServer, which is handily provided by Agones via the agones.dev/gameserver label, and it pinned down the specific pod as well. To handle TCPUDP case, generate two entries with the port name and a -tcp and -udp suffix, respectively. For example, in a GameServer with a TCPUDP port named as foo, we would have foo-tcp and foo-udp in the Service ports.

As for the non-functional requirements, we could also back fill the allocated load balancer IP to the GameServer itself in status.addresses, too, with type LoadBalancer as indications, and emit the event Load Balancer Allocated to the GameServer once it is active as well.

Another useful but optional nonfunctional requirement: it is currently undefined if the load balancer was deactivated (that means it have the load balancer IPs populated, but then going back to the Pending state), but we can have two strategies: either we remove it as allocated, or just remove the LB IPs and then do nothing but emit an error indicating Load Balancer IP lost, or delete the game server as a whole, and let the upper deployments to regenerate a game server and hope that a new IP can be allocated. This behavior shall be controlled as a new field spec.loadBalancer.loadBalancerIpLostPolicy with values of types "Deallocate" (set it to be deallocated and let the game server handle it, for example evict all the players and do an in-place restart to keep a clean game world state, rather than deleting the game server in order to not press too much pressure to the scheduler), "Repopulate" (default, don't touch the game server but delete the lost load balancer addresses in status, and keep watching till the load balancer IPs regenerate) or "Delete" (totally shut down the game server and let upper management layer regenerate a new one, this one is the least invasive way to quickly reconcile the game server state, but it causes the most pressure to the scheduler). But this one would require a lot of expertise to implement. To further complicate the case, we also need to define the behavior if only one of the load balancer IP just gone, say I have a IPv4 and a IPv6 load balancer address handed to the GameServer, but the IPv6 address is lost, yet the game server is still functional, so what if some players are unironically using IPv6 to connect to the game server, though?

But it is just good enough just with the functional requirement alone.

Isn't this looking like an easy enough low hanging fruit, and looks much better integrated as one, rather than having to watch the GameServer externally and look for my own k8s.stevefan1999.tech/cilium-load-balancer-enabled and k8s.stevefan1999.tech/cilium-egress-gateway-policy-enabled in the each of the GameServers' annotation? Also, if this can be integrated, the resource is always going to be up-to-date and in sync, which also means less performance impact to the Kubernetes API Server.

I will open-source the PoC controller soon but keep in mind it only supports Cilium right now, despite it can be worked out to be more generic, I still have some specific logics that also integrates the game server with Cilium's CiliumEgressGatewayPolicy to make sure the game server's egress IP itself would match the master server, because Valve doesn't let you customize the IP address in their query server and the source IP determination is based on an egress IP that connects to their master server...otherwise I have to inject a custom library and natively hook Steam's GetExternalIP function to rewrite it in C++ which is of course not ideal -- This I also have a PoC but I think it is not worth it. That has to be an external feature on its own, though.

This is expected to be a year-long feature because obviously no one in Google have the time and resource to do that. I wrote my own external controller in C# using https://github.com/buehler/dotnet-operator-sdk in just 2 days, but I still retained my Golang knowledge from 2 years ago. If the receptions are great, then fine, I'll do it myself.

github-actions[bot] commented 2 weeks ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

stevefan1999-personal commented 2 weeks ago

no-stale