Open markmandel opened 9 months ago
On a failure, we do see the following:
❯ kubectl get pods -n 1701124569
NAME READY STATUS RESTARTS AGE
quilkin-agones-agent-6fddc959b5-mklss 0/1 Running 0 3m38s
quilkin-relay-agones-5bb9df5765-kp4ks 1/1 Running 0 3m42s
Checking the logs for quilkin-agones-agent-6fddc959b5-mklss
we see the following:
❯ kubectl logs -n 1701124569 quilkin-agones-agent-6fddc959b5-mklss
{"timestamp":"2023-11-27T22:36:14.427641Z","level":"INFO","fields":{"message":"Starting Quilkin","version":"0.8.0-dev"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T22:36:14.427716Z","level":"DEBUG","fields":{"message":"config parameters","cli":"Cli { no_admin: false, config: \"quilkin.yaml\", admin_address: None, quiet: false, command: Agent(Agent { qcmp_port: 7600, relay: [Endpoint], region: None, zone: None, sub_zone: None, provider: Some(Agones { config_namespace: \"1701124569\", gameservers_namespace: \"1701124569\" }), idle_request_interval_secs: 30 }), log_format: Auto }"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T22:36:14.427807Z","level":"DEBUG","fields":{"message":"provided path not found","path":"quilkin.yaml"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T22:36:14.427832Z","level":"DEBUG","fields":{"message":"/etc path not found","path":"quilkin.yaml"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T22:36:14.428180Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"[::]:8000"},"target":"quilkin::cli::admin","filename":"src/cli/admin.rs"}
{"timestamp":"2023-11-27T22:36:14.428566Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"client_connect"},"spans":[{"servers":"[Endpoint]","name":"connect"},{"name":"client_connect"}]}
{"timestamp":"2023-11-27T22:36:14.439625Z","level":"DEBUG","fields":{"message":"received restart event from k8s"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T22:36:14.439683Z","level":"TRACE","fields":{"message":"Restarting with endpoints","endpoints":"[]"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T22:36:14.439701Z","level":"DEBUG","fields":{"message":"changed detected","watchers":0},"target":"quilkin::config::watch","filename":"src/config/watch.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T22:36:14.443570Z","level":"TRACE","fields":{"message":"new configmap event"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T22:36:14.444053Z","level":"TRACE","fields":{"message":"storing new value"},"target":"quilkin::config::slot","filename":"src/config/slot.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T22:36:19.429223Z","level":"INFO","fields":{"message":"Retrying to connect","attempt":1},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"client_connect"},"spans":[{"servers":"[Endpoint]","name":"connect"},{"name":"client_connect"}]}
{"timestamp":"2023-11-27T22:36:19.429304Z","level":"WARN","fields":{"message":"Unable to connect to the XDS server","error":"tonic::transport::Error(Transport, hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) }))"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"client_connect"},"spans":[{"servers":"[Endpoint]","name":"connect"},{"name":"client_connect"}]}
{"timestamp":"2023-11-27T22:36:21.476827Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"client_connect"},"spans":[{"servers":"[Endpoint]","name":"connect"},{"name":"client_connect"}]}
{"timestamp":"2023-11-27T22:36:21.479668Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"servers":"[Endpoint]","name":"connect"},"spans":[{"servers":"[Endpoint]","name":"connect"}]}
{"timestamp":"2023-11-27T22:36:21.479730Z","level":"TRACE","fields":{"message":"spawning stream background task"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2023-11-27T22:36:21.479808Z","level":"TRACE","fields":{"message":"starting relay client stream task"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T22:36:21.480924Z","level":"TRACE","fields":{"message":"Adding new watcher"},"target":"quilkin::config::slot","filename":"src/config/slot.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T22:36:21.480957Z","level":"TRACE","fields":{"message":"starting stream"},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T22:36:21.480984Z","level":"DEBUG","fields":{"message":"waiting for changes","watcher":"Receiver { shared: Shared { value: RwLock(PhantomData<std::sync::rwlock::RwLock<alloc::sync::Arc<quilkin::net::cluster::ClusterMap>>>, RwLock { data: ClusterMap { map: {None: EndpointSet { endpoints: {}, version: 1 }}, num_endpoints: 0, version: 1 } }), version: Version(2), is_closed: false, ref_count_rx: 1 }, version: Version(2) }"},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs"}
Which shows that while it did fail to connect to the relay service, it eventually connected, which means it should have kicked over to being ready 🤔
Also, just in case I need it again, here's my test rig. Insert your own dev image.
#!/bin/bash
set -eo pipefail
export RUST_BACKTRACE=1
export IMAGE_TAG=us-docker.pkg.dev/quilkin-mark-dev/release/quilkin:0.8.0-dev-2be3696
for i in {1..100}
do
echo "🔥 Test Run: $i"
cargo test --color=always --lib relay::tests::agones_token_router -- --nocapture --exact
done
Put some extra debugging in agent.rs so we could see which readiness health check is failing:
❯ kubectl logs -n 1701128813 quilkin-agones-agent-846b858654-bwvk6
{"timestamp":"2023-11-27T23:46:58.251392Z","level":"INFO","fields":{"message":"Starting Quilkin","version":"0.8.0-dev"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T23:46:58.251463Z","level":"DEBUG","fields":{"message":"config parameters","cli":"Cli { no_admin: false, config: \"quilkin.yaml\", admin_address: None, quiet: false, command: Agent(Agent { qcmp_port: 7600, relay: [Endpoint], region: None, zone: None, sub_zone: None, provider: Some(Agones { config_namespace: \"1701128813\", gameservers_namespace: \"1701128813\" }), idle_request_interval_secs: 30 }), log_format: Auto }"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T23:46:58.251548Z","level":"DEBUG","fields":{"message":"provided path not found","path":"quilkin.yaml"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T23:46:58.251584Z","level":"DEBUG","fields":{"message":"/etc path not found","path":"quilkin.yaml"},"target":"quilkin::cli","filename":"src/cli.rs"}
{"timestamp":"2023-11-27T23:46:58.252020Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"[::]:8000"},"target":"quilkin::cli::admin","filename":"src/cli/admin.rs"}
{"timestamp":"2023-11-27T23:46:58.252455Z","level":"INFO","fields":{"message":"attempting to connect to `http://quilkin-relay-agones:7900/`"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"client_connect"},"spans":[{"servers":"[Endpoint]","name":"connect"},{"name":"client_connect"}]}
{"timestamp":"2023-11-27T23:46:58.256881Z","level":"INFO","fields":{"message":"Connected to management server"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"servers":"[Endpoint]","name":"connect"},"spans":[{"servers":"[Endpoint]","name":"connect"}]}
{"timestamp":"2023-11-27T23:46:58.256948Z","level":"TRACE","fields":{"message":"spawning stream background task"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs"}
{"timestamp":"2023-11-27T23:46:58.257040Z","level":"TRACE","fields":{"message":"starting relay client stream task"},"target":"quilkin::net::xds::client","filename":"src/net/xds/client.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T23:46:58.258228Z","level":"TRACE","fields":{"message":"Adding new watcher"},"target":"quilkin::config::slot","filename":"src/config/slot.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T23:46:58.258276Z","level":"TRACE","fields":{"message":"starting stream"},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs","span":{"name":"handle_discovery_response"},"spans":[{"name":"handle_discovery_response"}]}
{"timestamp":"2023-11-27T23:46:58.258318Z","level":"DEBUG","fields":{"message":"waiting for changes","watcher":"Receiver { shared: Shared { value: RwLock(PhantomData<std::sync::rwlock::RwLock<alloc::sync::Arc<quilkin::net::cluster::ClusterMap>>>, RwLock { data: ClusterMap { map: {}, num_endpoints: 0, version: 0 } }), version: Version(0), is_closed: false, ref_count_rx: 1 }, version: Version(0) }"},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs"}
{"timestamp":"2023-11-27T23:46:58.260993Z","level":"DEBUG","fields":{"message":"received restart event from k8s"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.261049Z","level":"TRACE","fields":{"message":"Restarting with endpoints","endpoints":"[]"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.261066Z","level":"DEBUG","fields":{"message":"changed detected","watchers":1},"target":"quilkin::config::watch","filename":"src/config/watch.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.261210Z","level":"DEBUG","fields":{"message":"pushing update","resource_type":"type.googleapis.com/quilkin.config.v1alpha1.Cluster","watchers":1},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs"}
{"timestamp":"2023-11-27T23:46:58.261232Z","level":"DEBUG","fields":{"message":"waiting for changes","watcher":"Receiver { shared: Shared { value: RwLock(PhantomData<std::sync::rwlock::RwLock<alloc::sync::Arc<quilkin::net::cluster::ClusterMap>>>, RwLock { data: ClusterMap { map: {None: EndpointSet { endpoints: {}, version: 1 }}, num_endpoints: 0, version: 1 } }), version: Version(2), is_closed: false, ref_count_rx: 1 }, version: Version(2) }"},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs"}
{"timestamp":"2023-11-27T23:46:58.261856Z","level":"TRACE","fields":{"message":"new configmap event"},"target":"quilkin::config::providers::k8s","filename":"src/config/providers/k8s.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.262169Z","level":"TRACE","fields":{"message":"storing new value"},"target":"quilkin::config::slot","filename":"src/config/slot.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.262185Z","level":"TRACE","fields":{"message":"calling watcher"},"target":"quilkin::config::slot","filename":"src/config/slot.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:46:58.262192Z","level":"DEBUG","fields":{"message":"pushing update","resource_type":"type.googleapis.com/envoy.config.listener.v3.Listener","watchers":1},"target":"quilkin::net::xds::server","filename":"src/net/xds/server.rs","span":{"name":"task"},"spans":[{"name":"task"}]}
{"timestamp":"2023-11-27T23:47:01.636281Z","level":"TRACE","fields":{"message":"Agent health check","provider_healthy":"true","relay_healthy":"false"},"target":"quilkin::cli::agent","filename":"src/cli/agent.rs"}
{"timestamp":"2023-11-27T23:47:02.635294Z","level":"TRACE","fields":{"message":"Agent health check","provider_healthy":"true","relay_healthy":"false"},"target":"quilkin::cli::agent","filename":"src/cli/agent.rs"}
And it's RuntimeConfig.relay_is_healthy
is false!
But interestingly, this failure has no xDS connection issue on this failure, it something else.
Side note: Might be useful for health/readiness endpoints to return a json packet as to why they are unhealthy.
More debugging logging, current theory is that something is blocking in this block, before the code gets to here:
Next step is to narrow down what blocks before we can move to ready.
End of day, but can confirm, this line here blocks and never returns in some circumstances.
Will next dig into why that is, but @XAMPPRocky if you have suggestions also happy to take them.
May be related:
Apparently is flaky as well.
Will next dig into why that is, but @XAMPPRocky if you have suggestions also happy to take them.
I don't unfortunately, I can't replicate in prod, I think it might just be a timing thing with the threads.
No worries - digging in. At least I can reproduce it pretty reliably now.
May be related:
Apparently is flaky as well.
Just ran this unit test 300 times off main
and it passes... so I think this might be a red herring, and maybe only related to #873's implementation. Redirecting efforts back to quilkin/src/net/xds/client.rs
Okay, this is tricky.
Here's what a successful log looks like for an agent (with enhanced debugging from my branch): https://github.com/markmandel/quilkin/tree/flaky/agones_token_router
Seeing if I can identify what should be happening when it's not.
What I've seen in the the logs is usually stuff attempting to connect before a service is ready and then getting stuck in that.
Tracking bug for flakiness in the Agones integration test.
Test seems to consistently fail here, when it does fail: https://github.com/googleforgames/quilkin/blob/6e0adc902194155ec867a3fadd5bc984331f053b/agones/src/relay.rs#L312
Something about the relay agent not coming up as ready for some reason.