Open stevefan1999-personal opened 4 months ago
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
no-stale
While I do appreciate Quilkin that it is Rust based and security-oriented, but it is clearly a no-goer for many other more general game server use cases, either it only works for some niche that is specifically designed for that proxy, or if you can tolerate low performance. Not only it has a high jitter and high tail-latency, but also how it is designed to handle UDP traffic only with unnecessary traffic analyzers/filters that often add more bloat to the performance -- it is clearly not something eSports ready, per my testing with CS2, and worst of all, not all game servers use UDP, like Minecraft (it uses TCP port 25565) and other WebSocket based game server like Q3A in web browser.
As such, I would propose a way to expose the
GameServer
in Agones to automatically generate aService
withLoadBalancer
type specifically owned by theGameServer
, in the Agones controller instead, by using ownerReference which automatically terminates theService
also when theGameServer
is deleted as part of the garbage collection. We also need to handle labels and annotations to attach in the load balancer because load balancer IP address force assignment is often handled by annotations.I written an external controller PoC to do it. As I briefly explained in https://github.com/googleforgames/agones/issues/3804#issuecomment-2227585224, I have been using Cilium and its Node IPAM LB to get a high performing L4 load balancer with DDoS protection (integrated to the proxying server itself via Cloudflare Tunnel), so it works regardless of TCP or UDP, it can even carry out SCTP traffic if I want to. But in fact, this could work out to be more generic as we could use other load balancers like MetalLB, LoxiLB, HAProxy, AWS or even GCP's own (luxury-ass) cloud load balancer as well. If you self-host your Kubernetes with k3s you can even use their integrated Klipper "load balancer" which is just iptables SNAT that still works pretty great.
(As a cloud Kubernetes engineer and handles not just game servers alone, I really want to rant about the sheer expensiveness of load balancers provided by the public clouds, heh)Despite this kinda works, I still found the controller hard to keep up with GameServer updates and usually causes "resource version too old" problem, and causes the operator reconcilers to havoc in chaos -- it works well generally in low volume though, but I'm pessimistic about the performance of the external controller in the long run...
So...what if we can have it integrated into the Agones controller and
GameServer
itself? It would be easy like adding a newloadBalancer
field to theGameServer
spec.(For reference: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#servicespec-v1-core)
For the functional requirements,
ports
should also be automatically determined byspec.ports
inGameServer
as well, and then we have also need to have a label selector for the service to select the pod generated by theGameServer
, which is handily provided by Agones via theagones.dev/gameserver
label, and it pinned down the specific pod as well. To handleTCPUDP
case, generate two entries with the port name and a-tcp
and-udp
suffix, respectively. For example, in aGameServer
with aTCPUDP
port named asfoo
, we would havefoo-tcp
andfoo-udp
in theService
ports
.As for the non-functional requirements, we could also back fill the allocated load balancer IP to the
GameServer
itself instatus.addresses
, too, with typeLoadBalancer
as indications, and emit the eventLoad Balancer Allocated
to theGameServer
once it is active as well.Another useful but optional nonfunctional requirement: it is currently undefined if the load balancer was deactivated (that means it have the load balancer IPs populated, but then going back to the
Pending
state), but we can have two strategies: either we remove it as allocated, or just remove the LB IPs and then do nothing but emit an error indicatingLoad Balancer IP lost
, or delete the game server as a whole, and let the upper deployments to regenerate a game server and hope that a new IP can be allocated. This behavior shall be controlled as a new fieldspec.loadBalancer.loadBalancerIpLostPolicy
with values of types "Deallocate" (set it to be deallocated and let the game server handle it, for example evict all the players and do an in-place restart to keep a clean game world state, rather than deleting the game server in order to not press too much pressure to the scheduler), "Repopulate" (default, don't touch the game server but delete the lost load balancer addresses in status, and keep watching till the load balancer IPs regenerate) or "Delete" (totally shut down the game server and let upper management layer regenerate a new one, this one is the least invasive way to quickly reconcile the game server state, but it causes the most pressure to the scheduler). But this one would require a lot of expertise to implement. To further complicate the case, we also need to define the behavior if only one of the load balancer IP just gone, say I have a IPv4 and a IPv6 load balancer address handed to theGameServer
, but the IPv6 address is lost, yet the game server is still functional, so what if some players are unironically using IPv6 to connect to the game server, though?But it is just good enough just with the functional requirement alone.
Isn't this looking like an easy enough low hanging fruit, and looks much better integrated as one, rather than having to watch the
GameServer
externally and look for my ownk8s.stevefan1999.tech/cilium-load-balancer-enabled
andk8s.stevefan1999.tech/cilium-egress-gateway-policy-enabled
in the each of the GameServers' annotation? Also, if this can be integrated, the resource is always going to be up-to-date and in sync, which also means less performance impact to the Kubernetes API Server.I will open-source the PoC controller soon but keep in mind it only supports Cilium right now, despite it can be worked out to be more generic, I still have some specific logics that also integrates the game server with Cilium's
CiliumEgressGatewayPolicy
to make sure the game server's egress IP itself would match the master server, because Valve doesn't let you customize the IP address in their query server and the source IP determination is based on an egress IP that connects to their master server...otherwise I have to inject a custom library and natively hook Steam'sGetExternalIP
function to rewrite it in C++ which is of course not ideal -- This I also have a PoC but I think it is not worth it. That has to be an external feature on its own, though.This is expected to be a year-long feature because obviously no one in Google have the time and resource to do that. I wrote my own external controller in C# using https://github.com/buehler/dotnet-operator-sdk in just 2 days, but I still retained my Golang knowledge from 2 years ago. If the receptions are great, then fine, I'll do it myself.