googleforgames / quilkin

Quilkin is a non-transparent UDP proxy specifically designed for use with large scale multiplayer dedicated game server deployments, to ensure security, access control, telemetry data, metrics and more.
Apache License 2.0
1.28k stars 92 forks source link

Quilkin Agent restarts if an Agones Gameserver CRD is incorrect/corrupted #1005

Open daniellee opened 3 weeks ago

daniellee commented 3 weeks ago

What happened:

The Quilkin Agent failed to parse the list of gameserver specs from Agones:

{"timestamp":"2024-xxx","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"3","error":"failed to perform initial object list: Error deserializing response: missing field `addresses` at line 1 column 165491"},"target":"quilkin::config::providers","filename":"src/config/providers.rs","threadId":"ThreadId(2)"}

and then it retries 10 times:

{"timestamp":"2024-xxx","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"9","error":"failed to perform initial object list: Error deserializing response: missing field `addresses` at line 1 column 83075"},"target":"quilkin::config::providers","filename":"src/config/providers.rs","threadId":"ThreadId(2)"}

and after that the agent restarts which is quite disruptive for the gameserver cluster as new games failed to get connected to a Quilkin proxy during the downtime.

What you expected to happen:

That the Quilkin Agent can handle Agones gameserver specs that are missing fields due to being an older version or having gotten corrupted (some service directly changing the spec that makes it invalid). It should not constantly restart due to one invalid spec out of 100s.

How to reproduce it (as minimally and precisely as possible):

kubectl patch -v=8 gameserver mygame-111aa-abc1a --type json -p '[{"op": "remove", "path": "/status/addresses"}]'

Anything else we need to know?:

Environment:

markmandel commented 3 weeks ago

So I've been poking at this for a couple of days... trying to work out how we could do this.

And I believe the issue is actually here, in kube-runtime:

https://github.com/kube-rs/kube/blob/3d2471bf674fd0c0bcb148dcdfa59aa79e4cb63b/kube-runtime/src/watcher.rs#L604-L621

Which I'm reading as - if you get an invalid GameServer on initiation of the watch operation, the watch operation won't start at all. Does that align with what you are seeing, or do you see other GameServers changes still be observed?

We could look out for error type WatchStartFailed and ignore it, but that might be a problem for valid reasons we'd want to restart because the initial watch operation fell over (control plane being down might be one of them, or bad auth, etc).

GameServers do have set models, and I think it's valid to expect those models, but maybe the answer here is if we have fields that are optional in the CRD that we have as required in our Rust model, we should account for that better.

Or the other answer might be to go up to kube-runtime and have a special error type for deserialisation / model translation issues? I'm not entirely sure 🤔

Thoughts?