googleforgames / quilkin

Quilkin is a non-transparent UDP proxy specifically designed for use with large scale multiplayer dedicated game server deployments, to ensure security, access control, telemetry data, metrics and more.
Apache License 2.0
1.28k stars 91 forks source link

Incorrect Routing of Upstream Packets (Two clients and one game server) #988

Open zezhehh opened 2 months ago

zezhehh commented 2 months ago

What happened:

Hi all,

We've observed an issue that a proxy pod uses the same socket for traffic from different clients to the same game server, resulting in one of the clients not receiving response from the game server? Have any specific edge cases been identified as causing this issue?

[Our architecture setup] In the game server kubernetes cluster, we have a Load Balancer that routes to multiple proxy pods (not as sidecars) and control planes with the Agones provider. We’re using the same token for both clients.

image

What you expected to happen:

We expect two clients can receive corresponding responses.

image

How to reproduce it (as minimally and precisely as possible):

Unknown. Once it started to occur at a some point, it started happening intermittently throughout the day. We suspect there may be a buggy state in a specific pod instance.

Anything else we need to know?:

Environment:

│ Containers:                                                                                                                   │
│   quilkin:                                                                                                                    │
│     Container ID:  containerd://9975e4e955b7102297415506422e3c1ebd5b4c39a61bd5039656807e5ae4a1a7                              │
│     Image:         us-docker.pkg.dev/quilkin/release/quilkin:0.8.0                                                            │
│     Image ID:      us-docker.pkg.dev/quilkin/release/quilkin@sha256:3f0abe1af9bc378b16af84f7497f273caaf79308fd04ca302974d070f │
│ e68b8e2 
version: v1alpha1                                                                                                             │
│ filters:                                                                                                                      │
│   - name: quilkin.filters.capture.v1alpha1.Capture                                                                            │
│     config:                                                                                                                   │
│       suffix:                                                                                                                 │
│         size: 7                                                                                                               │
│         remove: true                                                                                                          │
│   - name: quilkin.filters.token_router.v1alpha1.TokenRouter 
XAMPPRocky commented 2 months ago

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

zezhehh commented 2 months ago

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

@XAMPPRocky Thanks for your reply. We would appreciate it if there is a latest image available.us-docker.pkg.dev/quilkin/release/quilkin:0.9.0-dev-50d91e4 Or is there guidance document for building custom image?


Edited: make build-image works

markmandel commented 1 month ago

You can also grab from one of our PR builds, e.g https://github.com/googleforgames/quilkin/pull/987#issuecomment-2209254047

zezhehh commented 1 month ago

Thank you for your issue! Would you mind testing one of the latest images and see if you can reproduce? Just to eliminate the possibility that it has already been fixed.

@XAMPPRocky We have tried using the latest image, but unfortunately, the issue persists. Additionally, the CPU and memory usage are much higher than in version v0.8.0. 🥲

XAMPPRocky commented 1 month ago

@zezhehh That is odd, because we have used and tested this setup of having one token per gameserver with multiple clients on a single proxy and haven't had an issue at all. Would you be able to check your load balancer setup that you're running infront of the proxy, my first reckon is with that as we don't put any load balancers in front of the proxies, so that's one difference I see with the setup we have tested.

markmandel commented 1 month ago

@zezhehh can you share what kind of LB it is? (i.e. is it a Google Cloud / AWS LoadBalancer? How is it configured etc?) Maybe there is something in there.

zezhehh commented 1 month ago

@XAMPPRocky @markmandel

We have a GCP k8s LB setting as follows, which should preserve the original source IP:port.

│ Name:                     quilkin-proxy                                                                                                                                             │
│ Namespace:                quilkin                                                                                                                                                   │
│ Labels:                   app.kubernetes.io/managed-by=Terraform                                                                                                                    │
│ Annotations:              cloud.google.com/l4-rbs: enabled                                                                                                                          │
│                           cloud.google.com/neg: {"exposed_ports": {"7337":{},"7338":{}}}                                                                                            │
│                           cloud.google.com/neg-status:                                                                                                                              │
│                             {"network_endpoint_groups":{"7337":"k8s1-c8523907-quilkin-quilkin-proxy-7337-c77ae3d6","7338":"k8s1-c8523907-quilkin-quilkin-proxy-7338-c0...           │
│                           service.kubernetes.io/backend-service: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                       │
│                           service.kubernetes.io/firewall-rule: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                         │
│                           service.kubernetes.io/firewall-rule-for-hc: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw-fw                                                               │
│                           service.kubernetes.io/healthcheck: k8s2-e2o2llay-quilkin-quilkin-proxy-6p9r6aaw                                                                           │
│                           service.kubernetes.io/udp-forwarding-rule: a6675171c0d494944ac00781b235adf0                                                                               │
│ Selector:                 role=proxy                                                                                                                                                │
│ Type:                     LoadBalancer                                                                                                                                              │
│ IP Family Policy:         SingleStack                                                                                                                                               │
│ IP Families:              IPv4                                                                                                                                                      │
│ IP:                       10.64.10.69                                                                                                                                               │
│ IPs:                      10.64.10.69                                                                                                                                               │
│ IP:                       34.34.148.137                                                                                                                                             │
│ LoadBalancer Ingress:     34.34.148.137                                                                                                                                             │
│ Port:                     proxy-udp  7337/UDP                                                                                                                                       │
│ TargetPort:               proxy-udp/UDP                                                                                                                                             │
│ NodePort:                 proxy-udp  30118/UDP                                                                                                                                      │
│ Endpoints:                10.64.64.185:7777,10.64.64.187:7777,10.64.64.74:7777                                                                                                      │
│ Port:                     ping-udp  7338/UDP                                                                                                                                        │
│ TargetPort:               ping-udp/UDP                                                                                                                                              │
│ NodePort:                 ping-udp  31821/UDP                                                                                                                                       │
│ Endpoints:                10.64.64.185:7600,10.64.64.187:7600,10.64.64.74:7600                                                                                                      │
│ Session Affinity:         None                                                                                                                                                      │
│ External Traffic Policy:  Local                                                                                                                                                     │
│ HealthCheck NodePort:     30737     

The issue has been confirmed by examining dumped UDP traffic in the pcap file, which can be viewed using Wireshark.

Quilkin Proxy Capture.pcap.zip

The file contains data from two ongoing games.

The Quilkin proxy is identified as 10.64.65.99.

In the first game, involving clients 35.189.221.32:38400 and 35.189.221.32:38402, communicating with the game server 10.64.8.46:8884, everything is functioning correctly.

However, in the second game, involving clients 35.189.221.32:38404 and 35.189.221.32:38405, communicating with the game server 10.64.8.46:7262, the client 35.189.221.32:38404 did not receive any response.

An example request can be identified with correlation_id: 044bbeb (Started from the packet No. 1835).

image


For visual reference:

We have a normal game: image

And a problematic one: image

markmandel commented 1 month ago

To clarify the point a little further before I start going over unit tests and seeing if I can replicate this in one.

  1. Is it intermittent where you get mis-routed packets, or once it starts, it doesn't stop?
  2. Is there any chance the game server could be sending data to the wrong port in the proxy by accident?
zezhehh commented 1 month ago

To clarify the point a little further before I start going over unit tests and seeing if I can replicate this in one.

  1. Is it intermittent where you get mis-routed packets, or once it starts, it doesn't stop?
  2. Is there any chance the game server could be sending data to the wrong port in the proxy by accident?
  1. It's intermittent, yes. Eventually, it occurs more frequently once it "starts."
  2. From the pcap file, we can see that the socket Quilkin utilized to communicate with the game server at ⁠10.64.8.46:7262 was the same for two clients (⁠35.189.221.32:38404 and ⁠35.189.221.32:38405), both connecting through ⁠10.64.65.99:34118. Although the game server sent responses to the same port, it is the client's port that the game server observed. Therefore, the answer is no (at least in this example).

Note: We also observed that the problematic port is likely the same one (34118). Not sure if the info is helpful.

markmandel commented 1 month ago

Hmnn, not 100% sure I followed that. I need to double check the code, because I know this got optimised a while back (not by me, so I'm not as familiar anymore) so we could handle way more endpoints per proxy , but I'm fairly sure, it should be:

image

I.e. for each client connecting, there should be a different port the gameserver connects to to send packets back.

If it's the same port, I'm not sure how we differentiate which packet should go where 🤔 are you saying there is only one quilkin proxy port being used by the gameserver process?

XAMPPRocky commented 1 month ago

If it's one port, I'm going to reckon that the load balancer is not preserving the IP:port of the client when sending traffic to the proxy, so to the proxy it looks like a single client.

zezhehh commented 1 month ago

@markmandel @XAMPPRocky Yes, we understand that the issue lies in the fact that the proxy uses the same port for two clients communicating with the game server. However, it is a symptom and not something we intentionally did (we didn’t change the source codes to remap the socket usage). What we can confirm is:

  1. We are running multiple proxies, but the problem arises when one specific proxy is used to route the traffic for those two clients.
  2. The proxy listens to the two clients from two different source ports (same IPs) after passing through the load balancer.
  3. The proxy forwards the packets from the clients to the game server using the same port.

(The No.2 and 3 can be observed in the dumped UDP traffic.

XAMPPRocky commented 1 month ago

@zezhehh to clarify I mean that I'm not sure that the load balancer is always providing a unique ip:port pair. Not that you've made a change but however the load balancer works / is configured it is not always sending unique addresses.

Would you be able to test this with a NodePort for proxy traffic instead? I think cutting out the load balancer will help us determine if you can replicate it with direct traffic to the proxy.

markmandel commented 1 month ago

Yes, we understand that the issue lies in the fact that the proxy uses the same port for two clients communicating with the game server.

That wasn't what I was getting at. I was getting at the traffic from the proxy to the game server should be over 2 different ports at 10.64.65.99 as there should be a separate port for each backing client (and since there are 2, there should be two ports).

Is that what you are seeing?

The proxy forwards the packets from the clients to the game server using the same port.

It seems like you are... but that leaves me extremely confused, because then ALL traffic back to clients would only go to one client. Without a different port on the proxy for each connection to the game server and back -- there's no way to differentiate where the traffic should head back to.

zezhehh commented 1 month ago

@zezhehh to clarify I mean that I'm not sure that the load balancer is always providing a unique ip:port pair. Not that you've made a change but however the load balancer works / is configured it is not always sending unique addresses.

Would you be able to test this with a NodePort for proxy traffic instead? I think cutting out the load balancer will help us determine if you can replicate it with direct traffic to the proxy.

Hmm.. I don't think the Load Balancer is the issue here. We have .spec.externalTrafficPolicy set to Local, as per the official doc:

Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading.

zezhehh commented 1 month ago

@markmandel

Sorry for any confusion. I'll try to make it clear.

The proxy forwards the packets from the clients to the game server using the same port.

Yes, it's what we're seeing. The game server records the client socket (which is actually the proxy socket) for each client, identified by user ID, and responds to this specific socket. In other words, the game server doesn’t check for conflicts with other sockets but simply sends the response to the originating socket.

This situation only occurs when the error arises. In most cases, everything functions as expected: two clients from two sockets...

markmandel commented 1 month ago

Got it - thanks.

Also, I assume there more than one endpoints in play at this point as well? (just to replicate as close as we can in a unit test to see if we can replicate).

zezhehh commented 1 month ago

Also, I assume there more than one endpoints in play at this point as well? (just to replicate as close as we can in a unit test to see if we can replicate).

@markmandel Yes, those automated testing matches occur within the same cluster as the real matches.

markmandel commented 2 weeks ago

I finally got some time to look into this - check out the test I wrote in #1010 -- unfortunately I could not replicate any of your reported issues

Would love you to look at the test though, see if there is something else to the scenario that I didn't manage to capture in the integration test. Let me know if you see anything.

zezhehh commented 1 week ago

I finally got some time to look into this - check out the test I wrote in #1010 -- unfortunately I could not replicate any of your reported issues

Would love you to look at the test though, see if there is something else to the scenario that I didn't manage to capture in the integration test. Let me know if you see anything.

hmm.. Could you try to allocate the clients with the same ip and different ports?

markmandel commented 1 week ago

hmm.. Could you try to allocate the clients with the same ip and different ports?

Unless you mean something else, the unit test has two sockets on the same IP (localhost) but different ports -- so I believe this tests this scenario, unless I am misunderstanding?

zezhehh commented 1 week ago

hmm.. Could you try to allocate the clients with the same ip and different ports?

Unless you mean something else, the unit test has two sockets on the same IP (localhost) but different ports -- so I believe this tests this scenario, unless I am misunderstanding?

Okay then all good. Thanks! We have some other different setup (same tokens from clients, etc.), but let's talk tomorrow! :)

markmandel commented 1 week ago

Just for easy discovery, assuming there's an issue in Quillkin, it's likely one of these spots:

So weird.