Closed luckyswede closed 1 year ago
Environment: K8s: 1.25.6-gke.1000 image: us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-6b24e2e
🤔 that seems odd. How are you deploying Quilkin in front of your GameServer? Can you show what the topology is?
The topology is routing token based, ie a shared set of proxys infront of a pool of gameservers. But they are all running in the same cluster.
If we fix up https://github.com/googleforgames/quilkin/pull/657 that would also help, as then you would be able to see how much time is being spent in the actual program.
The topology is routing token based, ie a shared set of proxys infront of a pool of gameservers. But they are all running in the same cluster.
Got it - so it's going from Service to GameServer when routed through Quilkin? (Vs direct through a hostPort I assume when connecting directly)?
When looking at the metrics for Quilkin, what sort of processing time are you seeing there?
I'm also wondering if the initial UDP connection via the Service is the 100ms? I assume you're running multiple clients?
Got it - so it's going from Service to GameServer when routed through Quilkin? (Vs direct through a hostPort I assume when connecting directly)?
I'm not sure I follow. In these tests I have been going directly to the Quilkin pod ip or GameServer pod ip.
Here is a graph for when I run 3 clients in parallel, 120 ping/second each, 10000 messages each. The latency numbers where (seconds):
# target 50% 0.000620712
# target 75% 0.000816828
# target 90% 0.000934497
# target 99% 0.120323
# target 99.9% 0.172
Here is a graph over processing time for this run:
Btw, what does the quilkin_packets_processing_duration_seconds_bucket
metric measure? How do I best display it with promql?
Another run I did looks like below, where I had a peak in the middle of the run:
Oh, it's the label read
that is significantly higher than write
.
quilkin_packets_processing_duration_seconds
is a histogram.
This is how we show it in our example grafana dashboards:
You can see a sample dashboard here: https://github.com/googleforgames/quilkin/blob/main/examples/grafana-dashboards/quilkin-general.png
Which bottom right is the graph of processing time, which is my usual experience - around p99 500us
Which bottom right is the graph of processing time, which is my usual experience - around p99 500us
Could what I experience have to do with the token routing? Or are there any config parameters regarding pool sizes, buffers, etc that I should tweak?
Can you run a histogram_quantile(0.99,sum(rate(quilkin_session_duration_secs_bucket[$__rate_interval]))
so we can see the histogram breakdown please for the 99 percentile?
Well, for the time period around the last graph, this is all I have. It doesn't make much sense to me :)
An interesting observation I made today is that since I started running more tests via Quilkin I regularly get liveness probe fails in the logs:
Liveness probe failed: Get "http://10.4.135.2:9091/live": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
These are kubelet
logs.
Can you do me a favour, since pasting promQL back and forth isn't quite working out - can you install our example dashboard on a graphana instance? It has a performance graph baked in.
https://github.com/googleforgames/quilkin/tree/main/examples/grafana-dashboards
Also, do you see any errors in the Quilkin logs, any restarts in the pods, anything like that?
I'll do it first thing next week, and get back to u.
Regarding errors, the only I see is the liveness probe error I wrote about in my last message.
@markmandel here are a couple of screenshots from that dasboard. The first one is covering the times when I was running the tests before reporting this issue, the second one shows an overview of the day after when I was running other tests.
FWIW, I do also occasionally see similar spikes of latency. However they're not on the same order of magnitude. Most of the case processing time is 250 microseconds, with occasional spikes to 500 microseconds. Can you share your filter configuration, that would be helpful in diagnosing what's causing the spike.
@XAMPPRocky here is my filter:
version: v1alpha1
filters:
- name: quilkin.filters.capture.v1alpha1.Capture
config:
suffix:
size: 7
remove: true
- name: quilkin.filters.token_router.v1alpha1.TokenRouter
And proxy:9091/config typically returns
{
"admin": {
"address": "0.0.0.0:9091"
},
"clusters": {
"default": {
"localities": [
{
"locality": null,
"endpoints": [
{
"address": "xyz.europe-west1-d.c.goals-gameservers.internal:9542",
"metadata": {
"quilkin.dev": {
"tokens": [
"TUVhVnpoNw=="
]
}
}
},
{
"address": "xyz.europe-west1-d.c.goals-gameservers.internal:10488",
"metadata": {
"quilkin.dev": {
"tokens": [
"SEJYM2RwQQ=="
]
}
}
},
{
"address": "xyz.europe-west1-d.c.goals-gameservers.internal:22651",
"metadata": {
"quilkin.dev": {
"tokens": [
"NzNva0pQbg=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:7151",
"metadata": {
"quilkin.dev": {
"tokens": [
"S2pOOVg3OA=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:7183",
"metadata": {
"quilkin.dev": {
"tokens": [
"SlR5WnYzQw=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:8491",
"metadata": {
"quilkin.dev": {
"tokens": [
"YWJjMDEyNA=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:8696",
"metadata": {
"quilkin.dev": {
"tokens": [
"WktyVFgxaA=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:8799",
"metadata": {
"quilkin.dev": {
"tokens": [
"eHphYjNDeg=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:9118",
"metadata": {
"quilkin.dev": {
"tokens": [
"WG9zUHZWZA=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:12717",
"metadata": {
"quilkin.dev": {
"tokens": [
"WU5pd29yag=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:15287",
"metadata": {
"quilkin.dev": {
"tokens": [
"NE1uaG11ZA=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:17967",
"metadata": {
"quilkin.dev": {
"tokens": [
"YWJjMDEyMw=="
]
}
}
},
{
"address": "xyz.europe-west1-c.c.goals-gameservers.internal:19126",
"metadata": {
"quilkin.dev": {
"tokens": [
"YWJjMDEyNQ=="
]
}
}
},
{
"address": "xyz.europe-west1-b.c.goals-gameservers.internal:10418",
"metadata": {
"quilkin.dev": {
"tokens": [
"aWtSdzVFdA=="
]
}
}
},
{
"address": "xyz.europe-west1-b.c.goals-gameservers.internal:12666",
"metadata": {
"quilkin.dev": {
"tokens": [
"aDFGdzlNTg=="
]
}
}
}
]
}
]
}
},
"filters": [
{
"name": "quilkin.filters.capture.v1alpha1.Capture",
"config": {
"metadataKey": "quilkin.dev/capture",
"suffix": {
"size": 7,
"remove": true
}
}
},
{
"name": "quilkin.filters.token_router.v1alpha1.TokenRouter",
"config": null
}
],
"id": "quilkin-proxies-84844c979b-wzrdg",
"management_servers": [
{
"address": "http://quilkin-manage-agones:80"
}
],
"port": 7000,
"version": "v1alpha1",
"maxmind_db": null
}
I have tested two other setups in the same cluster, towards the same gameservers as before.
With quilkin configured to simply forward directly to the gameserver:
proxy --to 10.4.146.136:7654
,
I get the following latencies (seconds):
# target 50% 0.000594626
# target 75% 0.000749476
# target 90% 0.000842386
# target 99% 0.000898132
# target 99.9% 0.000903707
And with quilkin configured with token routing, but using a static config without the xDS server:
filters:
- name: quilkin.filters.capture.v1alpha1.Capture
config:
suffix:
size: 7
remove: true
- name: quilkin.filters.token_router.v1alpha1.TokenRouter
clusters:
default:
localities:
- endpoints:
- address: 10.4.146.136:7654
metadata:
quilkin.dev:
tokens:
- YWJjMDEyMw==
I get the following latencies:
# target 50% 0.000735039
# target 75% 0.00088105
# target 90% 0.000968656
# target 99% 0.00286538
# target 99.9% 0.00347547
Ie, from my perspective the occasional high latencies seems to be related to the management server integration.
@markmandel @XAMPPRocky please note that I am running the build from https://github.com/googleforgames/quilkin/pull/683, if that makes a difference
Thanks for that! That's really comprehensive! Appreciate the efforts.
I'm at GDC this week, but will be keen to look into this on my return (unless @XAMPPRocky beats me to it).
Do you have a test suite you are running? I would like to be able to replicate this locally.
As @XAMPPRocky mentioned #657 may be required to finish to see what is causing those latency spikes. I'm guessing it's a lock somewhere being annoying.
My latest findings:
Running on a setup where the proxy gets conf from the xds server, and the xds server use a static conf from file gives these results:
$> fortio load -c 3 -qps 360 -n 9000 -payload abc0123abc0123 udp://10.4.128.9:7000
# target 50% 0.00062651
# target 75% 0.000813421
# target 90% 0.000925568
# target 99% 0.000992856
# target 99.9% 0.000999585
When using the same setup except I configure the xds server to read from Agones, I get these numbers:
$> fortio load -c 3 -qps 360 -n 9000 -payload abc0123abc0123 udp://10.4.128.9:7000
# target 50% 0.000617083
# target 75% 0.000815371
# target 90% 0.000934344
# target 99% 0.0378431
# target 99.9% 0.058
So, something fishy is going on when the management server gets it's config from Agones. It's a bit weird that it leaks into the proxy, almost as if a blocking call to the mgmt server sometimes is made during a read/write operation in the proxy. Are endpoints cached in a read-through cache maybe?
I use https://github.com/fortio/fortio for testing.
@markmandel if I change 500->5000 here: https://github.com/googleforgames/quilkin/blob/main/src/xds/client.rs#L288 i get really good p99 latency:
# target 50% 0.000617077
# target 75% 0.000810083
# target 90% 0.000925886
# target 99% 0.000995368
# target 99.9% 0.045
soo the problem is somewhere in that loop :)
Found the issue @markmandel and @XAMPPRocky :) The problem occurs when the xds server returns hostnames as endpoint addresses, which cause the proxy to do ip lookup on the hostname. Which takes time... We've verified with a local build where the xds server converts to ip addresses immediately in the agones integration, so the proxy always gets ip addresses.
A better fix is to make the name->ip lookup async in the proxy somehow.
Oh wow! That's some serious investigative work!
Everything I do airways uses IPs, so that would also explain why I never see this.
I wonder if it would work to convert the hostname to an IP in the xDS server before sending it to the proxy? 🤔
I wonder if it would work to convert the hostname to an IP in the xDS server before sending it to the proxy?
Well that would eliminate some of the benefit of using DNS for your upstream if we only check once, because part of the reason to use DNS, is so you can have a single address that can be shared amongst multiple addresses, and maybe you can't rely on it always being the same address due to the cloud or ISP provider.
I think the solution is to have a registry of hostnames to IPs that is TTL'ed similar to session map, so that the resolution is cached for frequent use, but if enough time has passed the value is refreshed.
Yeah, I agree - also this should all happen in the proxy, since a hostname could come from anywhere - static configuration, xDS, wherever. So it makes sense to do it where it lands, not at a level above.
So I disagree with my original idea as well 😁
Hi, this is the workaround we are using for now: https://github.com/goalsgame/quilkin/pull/1/files
@XAMPPRocky
I think the solution is to have a registry of hostnames to IPs that is TTL'ed similar to session map, so that the resolution is cached for frequent use, but if enough time has passed the value is refreshed.
How come this code path has impact on the proxy latency at all? What will happen if we have >1000 gameserver instances or more, that will impact the proxy latency as well, right?
How come this code path has impact on the proxy latency at all?
Well you need to resolve a DNS address to an IP address to figure out where you're sending the packets. Currently it's impacting quite a bit because we're relying on the OS to resolve and cache the DNS on each worker. We just need to change it so that the resolution is cached in our program. However even with that optimisation (which I think will help a lot), using IP addresses will always be faster than DNS because it's simply one less step to do.
If you want to have optimal routing ideally your upstream is going to be a set of stable IPv6 addresses for each of the servers, that would provide the lowest overhead to time to first packet. Of course that's not always possible which is why DNS is available.
Currently it's impacting quite a bit because we're relying on the OS to resolve and cache the DNS on each worker.
I think the question is why this is preventing the proxying of request to happen. Shouldn't this be done in parallell with proxying before applying saving the new mapping from token to ip in a map? Or doens't that work for some reason?
I think the question is why this is preventing the proxying of request to happen. Shouldn't this be done in parallell with proxying before applying saving the new mapping from token to ip in a map? Or doens't that work for some reason?
The issue is that the tokio
API for resolving DNS entries is actually a blocking API, so another improvement we can add here is to move the resolution to be handled asynchronously.
so another improvement we can add here is to move the resolution to be handled asynchronously
Yeah, sounds like a good plan. So could a huge list of game servers in an update (with ips already resoled) have a negative impact on proxy performace too? We can test this ofc.
@andersroos Would you be able to try the image from #721? It should no longer result in high latency.
us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-d95ab70
@XAMPPRocky Yes, will get back to you with the result.
@XAMPPRocky We get much lower p99.9 latency (2.5ms) with this PR compared to main (~100ms) with 5 game servers, so all good.
Our patched code with dns lookup in the manager have even lower p99.9 latency (around 1ms) but I think 2.5 will be fine for gameplay.
@andersroos I've added a TTL cache on DNS names, try the new image from that same PR once it builds and let me know if that improves or worsens the latency.
us-docker.pkg.dev/quilkin/ci/quilkin:0.6.0-dev-f62668c
@XAMPPRocky I've done a quick smoke test, and the TTL cache improved things.
I was running 3 concurrent clients, each running 100 ping/s for 3000 messages (fortio load -c 3 -qps 300 -n 9000
)
direct to gs
# target 99% 0.000992007
# target 99.9% 0.000999807
via quilkin, xds provisioned 1 endpoint in cluster
# target 99% 0.000992776
# target 99.9% 0.000999912
via quilkin, xds provisioned 5 endpoints in cluster
# target 99% 0.000993136
# target 99.9% 0.001
via quilkin, , xds provisioned 10 endpoints in cluster
# target 99% 0.000994109
# target 99.9% 0.00207945
via quilkin, xds provisioned 20 endpoints in cluster
# target 99% 0.000993159
# target 99.9% 0.002
Hi,
I'm doing some simple latency tests of ping messages via Quilkin and I see high p99 response times (histograms below). I'm using Fortio to run the load tests. I'm running the tests from a pod in the same k8s cluster as the proxy and the gameserver. When I run directly to the gameserver I get p99 of 1ms, this number is stable over multiple runs. When I run the same test via Quilkin, p90 is good, but p99 is around 100ms, this also shows over multiple runs.
The cpu usage is low, there are no reported network errors, or anything else that indicates any issues.
Test run via proxy:
Test run directly towards the gameserver: