Quilkin Agent restarts if an Agones Gameserver CRD is incorrect/corrupted

What happened:

The Quilkin Agent failed to parse the list of gameserver specs from Agones:

{"timestamp":"2024-xxx","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"3","error":"failed to perform initial object list: Error deserializing response: missing field `addresses` at line 1 column 165491"},"target":"quilkin::config::providers","filename":"src/config/providers.rs","threadId":"ThreadId(2)"}

and then it retries 10 times:

{"timestamp":"2024-xxx","level":"WARN","fields":{"message":"provider task error, retrying","attempt":"9","error":"failed to perform initial object list: Error deserializing response: missing field `addresses` at line 1 column 83075"},"target":"quilkin::config::providers","filename":"src/config/providers.rs","threadId":"ThreadId(2)"}

and after that the agent restarts which is quite disruptive for the gameserver cluster as new games failed to get connected to a Quilkin proxy during the downtime.

What you expected to happen:

That the Quilkin Agent can handle Agones gameserver specs that are missing fields due to being an older version or having gotten corrupted (some service directly changing the spec that makes it invalid). It should not constantly restart due to one invalid spec out of 100s.

How to reproduce it (as minimally and precisely as possible):

You need a Kubernetes cluster running with ready or allocated Agones gameservers and a Quilkin Agent
Use kubectl to access the cluster and switch to the gameservers namespace where the gameserver pods are running.
Corrupt a gameserver spec by picking a running gameserver and run the following patch command to remove a field:

kubectl patch -v=8 gameserver mygame-111aa-abc1a --type json -p '[{"op": "remove", "path": "/status/addresses"}]'

Anything else we need to know?:

Environment:

Quilkin version: 0.9.0
Execution environment (binary, container, etc): container
Operating system: docker image
Custom filters? (Yes/No - if so, what do they do?):
Log(s):
Others:

So I've been poking at this for a couple of days... trying to work out how we could do this.

And I believe the issue is actually here, in kube-runtime:

https://github.com/kube-rs/kube/blob/3d2471bf674fd0c0bcb148dcdfa59aa79e4cb63b/kube-runtime/src/watcher.rs#L604-L621

Which I'm reading as - if you get an invalid GameServer on initiation of the watch operation, the watch operation won't start at all. Does that align with what you are seeing, or do you see other GameServers changes still be observed?

We could look out for error type WatchStartFailed and ignore it, but that might be a problem for valid reasons we'd want to restart because the initial watch operation fell over (control plane being down might be one of them, or bad auth, etc).

GameServers do have set models, and I think it's valid to expect those models, but maybe the answer here is if we have fields that are optional in the CRD that we have as required in our Rust model, we should account for that better.

Or the other answer might be to go up to kube-runtime and have a special error type for deserialisation / model translation issues? I'm not entirely sure 🤔

Thoughts?

googleforgames / quilkin

Quilkin Agent restarts if an Agones Gameserver CRD is incorrect/corrupted #1005