High P99 latency - Githubissues

luckyswede commented 1 year ago

Hi,

I'm doing some simple latency tests of ping messages via Quilkin and I see high p99 response times (histograms below). I'm using Fortio to run the load tests. I'm running the tests from a pod in the same k8s cluster as the proxy and the gameserver. When I run directly to the gameserver I get p99 of 1ms, this number is stable over multiple runs. When I run the same test via Quilkin, p90 is good, but p99 is around 100ms, this also shows over multiple runs.

The cpu usage is low, there are no reported network errors, or anything else that indicates any issues.

Test run via proxy:

# fortio load -c 3 -qps 360 -n 3000 -payload abc0123abc0123 udp://10.4.130.12:7000
17:49:21 I scli.go:88> Starting Φορτίο 1.53.1 h1:aeAWrnkXIBEdhueQLRwvaM5INulybDP+ZgOChgKz/NQ= go1.19.7 amd64 linux
Fortio 1.53.1 running at 360 queries per second, 16->16 procs, for 3000 calls: udp://10.4.130.12:7000
17:49:21 I udprunner.go:226> Starting udp test for udp://10.4.130.12:7000 with 3 threads at 360.0 qps
Starting at 360 qps with 3 thread(s) [gomax 16] : exactly 3000, 1000 calls each (total 3000 + 0)
17:49:29 I periodic.go:829> T000 ended after 8.334127252s : 1000 calls. qps=119.98856866026648
17:49:29 I periodic.go:829> T002 ended after 8.334184582s : 1000 calls. qps=119.98774327122287
17:49:29 I periodic.go:829> T001 ended after 8.334190754s : 1000 calls. qps=119.98765441264341
Ended after 8.334224693s : 3000 calls. qps=359.96
Aggregated Sleep Time : count 2997 avg -0.0031391737 +/- 0.02786 min -0.182174958 max 0.008007254 sum -9.40810346
# range, mid point, percentile, count
>= -0.182175 <= -0.001 , -0.0915875 , 16.92, 507
> -0.001 <= 0 , -0.0005 , 17.02, 3
> 0 <= 0.001 , 0.0005 , 17.12, 3
> 0.002 <= 0.003 , 0.0025 , 17.28, 5
> 0.003 <= 0.004 , 0.0035 , 17.52, 7
> 0.004 <= 0.005 , 0.0045 , 17.82, 9
> 0.005 <= 0.006 , 0.0055 , 18.42, 18
> 0.006 <= 0.007 , 0.0065 , 27.43, 270
> 0.007 <= 0.008 , 0.0075 , 99.90, 2172
> 0.008 <= 0.00800725 , 0.00800363 , 100.00, 3
# target 50% 0.00731146
WARNING 16.92% of sleep were falling behind
Aggregated Function Time : count 3000 avg 0.0017685162 +/- 0.01253 min 0.000181624 max 0.189931249 sum 5.30554874
# range, mid point, percentile, count
>= 0.000181624 <= 0.001 , 0.000590812 , 98.00, 2940
> 0.001 <= 0.002 , 0.0015 , 98.30, 9
> 0.002 <= 0.003 , 0.0025 , 98.43, 4
> 0.003 <= 0.004 , 0.0035 , 98.50, 2
> 0.004 <= 0.005 , 0.0045 , 98.60, 3
> 0.02 <= 0.025 , 0.0225 , 98.67, 2
> 0.025 <= 0.03 , 0.0275 , 98.70, 1
> 0.08 <= 0.09 , 0.085 , 98.90, 6
> 0.09 <= 0.1 , 0.095 , 99.40, 15
> 0.1 <= 0.12 , 0.11 , 99.80, 12
> 0.12 <= 0.14 , 0.13 , 99.90, 3
> 0.18 <= 0.189931 , 0.184966 , 100.00, 3
# target 50% 0.000599026
# target 75% 0.000807867
# target 90% 0.000933171
# target 99% 0.092
# target 99.9% 0.14
Error cases : no data
Sockets used: 3 (for perfect no error run, would be 3)
Total Bytes sent: 42000, received: 42000
udp OK : 3000 (100.0 %)
All done 3000 calls (plus 0 warmup) 1.769 ms avg, 360.0 qps

Test run directly towards the gameserver:

# fortio load -c 3 -qps 360 -n 3000 -payload abc0123abc0123 udp://10.4.146.138:7654 
17:53:45 I scli.go:88> Starting Φορτίο 1.53.1 h1:aeAWrnkXIBEdhueQLRwvaM5INulybDP+ZgOChgKz/NQ= go1.19.7 amd64 linux
Fortio 1.53.1 running at 360 queries per second, 16->16 procs, for 3000 calls: udp://10.4.146.138:7654
17:53:45 I udprunner.go:226> Starting udp test for udp://10.4.146.138:7654 with 3 threads at 360.0 qps
Starting at 360 qps with 3 thread(s) [gomax 16] : exactly 3000, 1000 calls each (total 3000 + 0)
17:53:53 I periodic.go:829> T002 ended after 8.334296336s : 1000 calls. qps=119.98613436391734
17:53:53 I periodic.go:829> T000 ended after 8.334358876s : 1000 calls. qps=119.98523400277922
17:53:53 I periodic.go:829> T001 ended after 8.334295561s : 1000 calls. qps=119.98614552133952
Ended after 8.334387482s : 3000 calls. qps=359.95
Sleep times : count 2997 avg 0.0074122955 +/- 0.0004388 min 0.000255187 max 0.008153068 sum 22.2146496
Aggregated Function Time : count 3000 avg 0.00020268997 +/- 4.417e-05 min 0.000148956 max 0.001296863 sum 0.60806992
# range, mid point, percentile, count
>= 0.000148956 <= 0.001 , 0.000574478 , 99.90, 2997
> 0.001 <= 0.00129686 , 0.00114843 , 100.00, 3
# target 50% 0.000574762
# target 75% 0.000787807
# target 90% 0.000915634
# target 99% 0.00099233
# target 99.9% 0.001
Error cases : no data
Sockets used: 3 (for perfect no error run, would be 3)
Total Bytes sent: 42000, received: 42000
udp OK : 3000 (100.0 %)
All done 3000 calls (plus 0 warmup) 0.203 ms avg, 360.0 qps

luckyswede commented 1 year ago

Environment: K8s: 1.25.6-gke.1000 image: us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-6b24e2e

markmandel commented 1 year ago

🤔 that seems odd. How are you deploying Quilkin in front of your GameServer? Can you show what the topology is?

luckyswede commented 1 year ago

The topology is routing token based, ie a shared set of proxys infront of a pool of gameservers. But they are all running in the same cluster.

XAMPPRocky commented 1 year ago

If we fix up https://github.com/googleforgames/quilkin/pull/657 that would also help, as then you would be able to see how much time is being spent in the actual program.

markmandel commented 1 year ago

The topology is routing token based, ie a shared set of proxys infront of a pool of gameservers. But they are all running in the same cluster.

Got it - so it's going from Service to GameServer when routed through Quilkin? (Vs direct through a hostPort I assume when connecting directly)?

When looking at the metrics for Quilkin, what sort of processing time are you seeing there?

I'm also wondering if the initial UDP connection via the Service is the 100ms? I assume you're running multiple clients?

luckyswede commented 1 year ago

Got it - so it's going from Service to GameServer when routed through Quilkin? (Vs direct through a hostPort I assume when connecting directly)?

I'm not sure I follow. In these tests I have been going directly to the Quilkin pod ip or GameServer pod ip.

Here is a graph for when I run 3 clients in parallel, 120 ping/second each, 10000 messages each. The latency numbers where (seconds):

# target 50% 0.000620712
# target 75% 0.000816828
# target 90% 0.000934497
# target 99% 0.120323
# target 99.9% 0.172

Here is a graph over processing time for this run:

Btw, what does the quilkin_packets_processing_duration_seconds_bucket metric measure? How do I best display it with promql?

luckyswede commented 1 year ago

Another run I did looks like below, where I had a peak in the middle of the run:

luckyswede commented 1 year ago

Oh, it's the label read that is significantly higher than write.

markmandel commented 1 year ago

quilkin_packets_processing_duration_seconds is a histogram.

https://googleforgames.github.io/quilkin/main/book/services/proxy/metrics.html#:~:text=quilkin_packets_processing_duration_seconds

This is how we show it in our example grafana dashboards:

https://github.com/googleforgames/quilkin/blob/df9ffe11f24f7588d1b188e6c7c0132c7974d491/examples/grafana-dashboards/dashboards/dashboard-gameservers.yaml#L576-L615

markmandel commented 1 year ago

You can see a sample dashboard here: https://github.com/googleforgames/quilkin/blob/main/examples/grafana-dashboards/quilkin-general.png

Which bottom right is the graph of processing time, which is my usual experience - around p99 500us

luckyswede commented 1 year ago

Which bottom right is the graph of processing time, which is my usual experience - around p99 500us

Could what I experience have to do with the token routing? Or are there any config parameters regarding pool sizes, buffers, etc that I should tweak?

markmandel commented 1 year ago

Can you run a histogram_quantile(0.99,sum(rate(quilkin_session_duration_secs_bucket[$__rate_interval])) so we can see the histogram breakdown please for the 99 percentile?

luckyswede commented 1 year ago

Well, for the time period around the last graph, this is all I have. It doesn't make much sense to me :)

luckyswede commented 1 year ago

An interesting observation I made today is that since I started running more tests via Quilkin I regularly get liveness probe fails in the logs: Liveness probe failed: Get "http://10.4.135.2:9091/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers) These are kubelet logs.

markmandel commented 1 year ago

Can you do me a favour, since pasting promQL back and forth isn't quite working out - can you install our example dashboard on a graphana instance? It has a performance graph baked in.

https://github.com/googleforgames/quilkin/tree/main/examples/grafana-dashboards

Also, do you see any errors in the Quilkin logs, any restarts in the pods, anything like that?

luckyswede commented 1 year ago

I'll do it first thing next week, and get back to u.

Regarding errors, the only I see is the liveness probe error I wrote about in my last message.

luckyswede commented 1 year ago

@markmandel here are a couple of screenshots from that dasboard. The first one is covering the times when I was running the tests before reporting this issue, the second one shows an overview of the day after when I was running other tests.

XAMPPRocky commented 1 year ago

FWIW, I do also occasionally see similar spikes of latency. However they're not on the same order of magnitude. Most of the case processing time is 250 microseconds, with occasional spikes to 500 microseconds. Can you share your filter configuration, that would be helpful in diagnosing what's causing the spike.

luckyswede commented 1 year ago

@XAMPPRocky here is my filter:

version: v1alpha1
filters:
  - name: quilkin.filters.capture.v1alpha1.Capture
    config:
      suffix:
        size: 7
        remove: true
  - name: quilkin.filters.token_router.v1alpha1.TokenRouter

luckyswede commented 1 year ago

And proxy:9091/config typically returns

{
  "admin": {
    "address": "0.0.0.0:9091"
  },
  "clusters": {
    "default": {
      "localities": [
        {
          "locality": null,
          "endpoints": [
            {
              "address": "xyz.europe-west1-d.c.goals-gameservers.internal:9542",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "TUVhVnpoNw=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-d.c.goals-gameservers.internal:10488",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "SEJYM2RwQQ=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-d.c.goals-gameservers.internal:22651",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "NzNva0pQbg=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:7151",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "S2pOOVg3OA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:7183",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "SlR5WnYzQw=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:8491",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "YWJjMDEyNA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:8696",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "WktyVFgxaA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:8799",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "eHphYjNDeg=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:9118",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "WG9zUHZWZA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:12717",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "WU5pd29yag=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:15287",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "NE1uaG11ZA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:17967",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "YWJjMDEyMw=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-c.c.goals-gameservers.internal:19126",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "YWJjMDEyNQ=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-b.c.goals-gameservers.internal:10418",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "aWtSdzVFdA=="
                  ]
                }
              }
            },
            {
              "address": "xyz.europe-west1-b.c.goals-gameservers.internal:12666",
              "metadata": {
                "quilkin.dev": {
                  "tokens": [
                    "aDFGdzlNTg=="
                  ]
                }
              }
            }
          ]
        }
      ]
    }
  },
  "filters": [
    {
      "name": "quilkin.filters.capture.v1alpha1.Capture",
      "config": {
        "metadataKey": "quilkin.dev/capture",
        "suffix": {
          "size": 7,
          "remove": true
        }
      }
    },
    {
      "name": "quilkin.filters.token_router.v1alpha1.TokenRouter",
      "config": null
    }
  ],
  "id": "quilkin-proxies-84844c979b-wzrdg",
  "management_servers": [
    {
      "address": "http://quilkin-manage-agones:80"
    }
  ],
  "port": 7000,
  "version": "v1alpha1",
  "maxmind_db": null
}

luckyswede commented 1 year ago

I have tested two other setups in the same cluster, towards the same gameservers as before.

With quilkin configured to simply forward directly to the gameserver: proxy --to 10.4.146.136:7654, I get the following latencies (seconds):

# target 50% 0.000594626
# target 75% 0.000749476
# target 90% 0.000842386
# target 99% 0.000898132
# target 99.9% 0.000903707

And with quilkin configured with token routing, but using a static config without the xDS server:

        filters:
        - name: quilkin.filters.capture.v1alpha1.Capture
          config:
              suffix:
                  size: 7
                  remove: true
        - name: quilkin.filters.token_router.v1alpha1.TokenRouter
        clusters:
          default:
            localities:
              - endpoints:
                  - address: 10.4.146.136:7654
                    metadata: 
                        quilkin.dev:
                            tokens: 
                            - YWJjMDEyMw==

I get the following latencies:

# target 50% 0.000735039
# target 75% 0.00088105
# target 90% 0.000968656
# target 99% 0.00286538
# target 99.9% 0.00347547

Ie, from my perspective the occasional high latencies seems to be related to the management server integration.

luckyswede commented 1 year ago

@markmandel @XAMPPRocky please note that I am running the build from https://github.com/googleforgames/quilkin/pull/683, if that makes a difference

markmandel commented 1 year ago

Thanks for that! That's really comprehensive! Appreciate the efforts.

I'm at GDC this week, but will be keen to look into this on my return (unless @XAMPPRocky beats me to it).

Do you have a test suite you are running? I would like to be able to replicate this locally.

As @XAMPPRocky mentioned #657 may be required to finish to see what is causing those latency spikes. I'm guessing it's a lock somewhere being annoying.

luckyswede commented 1 year ago

My latest findings:

Running on a setup where the proxy gets conf from the xds server, and the xds server use a static conf from file gives these results:

$> fortio load -c 3 -qps 360 -n 9000 -payload abc0123abc0123 udp://10.4.128.9:7000
# target 50% 0.00062651
# target 75% 0.000813421
# target 90% 0.000925568
# target 99% 0.000992856
# target 99.9% 0.000999585

When using the same setup except I configure the xds server to read from Agones, I get these numbers:

$> fortio load -c 3 -qps 360 -n 9000 -payload abc0123abc0123 udp://10.4.128.9:7000
# target 50% 0.000617083
# target 75% 0.000815371
# target 90% 0.000934344
# target 99% 0.0378431
# target 99.9% 0.058

So, something fishy is going on when the management server gets it's config from Agones. It's a bit weird that it leaks into the proxy, almost as if a blocking call to the mgmt server sometimes is made during a read/write operation in the proxy. Are endpoints cached in a read-through cache maybe?

I use https://github.com/fortio/fortio for testing.

luckyswede commented 1 year ago

@markmandel if I change 500->5000 here: https://github.com/googleforgames/quilkin/blob/main/src/xds/client.rs#L288 i get really good p99 latency:

# target 50% 0.000617077
# target 75% 0.000810083
# target 90% 0.000925886
# target 99% 0.000995368
# target 99.9% 0.045

soo the problem is somewhere in that loop :)

luckyswede commented 1 year ago

Found the issue @markmandel and @XAMPPRocky :) The problem occurs when the xds server returns hostnames as endpoint addresses, which cause the proxy to do ip lookup on the hostname. Which takes time... We've verified with a local build where the xds server converts to ip addresses immediately in the agones integration, so the proxy always gets ip addresses.

A better fix is to make the name->ip lookup async in the proxy somehow.

markmandel commented 1 year ago

Oh wow! That's some serious investigative work!

Everything I do airways uses IPs, so that would also explain why I never see this.

I wonder if it would work to convert the hostname to an IP in the xDS server before sending it to the proxy? 🤔

XAMPPRocky commented 1 year ago

I wonder if it would work to convert the hostname to an IP in the xDS server before sending it to the proxy?

Well that would eliminate some of the benefit of using DNS for your upstream if we only check once, because part of the reason to use DNS, is so you can have a single address that can be shared amongst multiple addresses, and maybe you can't rely on it always being the same address due to the cloud or ISP provider.

I think the solution is to have a registry of hostnames to IPs that is TTL'ed similar to session map, so that the resolution is cached for frequent use, but if enough time has passed the value is refreshed.

markmandel commented 1 year ago

Yeah, I agree - also this should all happen in the proxy, since a hostname could come from anywhere - static configuration, xDS, wherever. So it makes sense to do it where it lands, not at a level above.

So I disagree with my original idea as well 😁

andersroos commented 1 year ago

Hi, this is the workaround we are using for now: https://github.com/goalsgame/quilkin/pull/1/files

luckyswede commented 1 year ago

@XAMPPRocky

I think the solution is to have a registry of hostnames to IPs that is TTL'ed similar to session map, so that the resolution is cached for frequent use, but if enough time has passed the value is refreshed.

How come this code path has impact on the proxy latency at all? What will happen if we have >1000 gameserver instances or more, that will impact the proxy latency as well, right?

XAMPPRocky commented 1 year ago

How come this code path has impact on the proxy latency at all?

Well you need to resolve a DNS address to an IP address to figure out where you're sending the packets. Currently it's impacting quite a bit because we're relying on the OS to resolve and cache the DNS on each worker. We just need to change it so that the resolution is cached in our program. However even with that optimisation (which I think will help a lot), using IP addresses will always be faster than DNS because it's simply one less step to do.

If you want to have optimal routing ideally your upstream is going to be a set of stable IPv6 addresses for each of the servers, that would provide the lowest overhead to time to first packet. Of course that's not always possible which is why DNS is available.

andersroos commented 1 year ago

Currently it's impacting quite a bit because we're relying on the OS to resolve and cache the DNS on each worker.

I think the question is why this is preventing the proxying of request to happen. Shouldn't this be done in parallell with proxying before applying saving the new mapping from token to ip in a map? Or doens't that work for some reason?

XAMPPRocky commented 1 year ago

I think the question is why this is preventing the proxying of request to happen. Shouldn't this be done in parallell with proxying before applying saving the new mapping from token to ip in a map? Or doens't that work for some reason?

The issue is that the tokio API for resolving DNS entries is actually a blocking API, so another improvement we can add here is to move the resolution to be handled asynchronously.

andersroos commented 1 year ago

so another improvement we can add here is to move the resolution to be handled asynchronously

Yeah, sounds like a good plan. So could a huge list of game servers in an update (with ips already resoled) have a negative impact on proxy performace too? We can test this ofc.

XAMPPRocky commented 1 year ago

@andersroos Would you be able to try the image from #721? It should no longer result in high latency.

us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-d95ab70

andersroos commented 1 year ago

@XAMPPRocky Yes, will get back to you with the result.

andersroos commented 1 year ago

@XAMPPRocky We get much lower p99.9 latency (2.5ms) with this PR compared to main (~100ms) with 5 game servers, so all good.

Our patched code with dns lookup in the manager have even lower p99.9 latency (around 1ms) but I think 2.5 will be fine for gameplay.

XAMPPRocky commented 1 year ago

@andersroos I've added a TTL cache on DNS names, try the new image from that same PR once it builds and let me know if that improves or worsens the latency.


us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-f62668c

luckyswede commented 1 year ago

@XAMPPRocky I've done a quick smoke test, and the TTL cache improved things.

I was running 3 concurrent clients, each running 100 ping/s for 3000 messages (fortio load -c 3 -qps 300 -n 9000)

direct to gs

# target 99% 0.000992007
# target 99.9% 0.000999807

via quilkin, xds provisioned 1 endpoint in cluster

# target 99% 0.000992776
# target 99.9% 0.000999912

via quilkin, xds provisioned 5 endpoints in cluster

# target 99% 0.000993136
# target 99.9% 0.001

via quilkin, , xds provisioned 10 endpoints in cluster

# target 99% 0.000994109
# target 99.9% 0.00207945

via quilkin, xds provisioned 20 endpoints in cluster

# target 99% 0.000993159
# target 99.9% 0.002

googleforgames / quilkin

High P99 latency #718