Open daniellee opened 3 weeks ago
So I've been poking at this for a couple of days... trying to work out how we could do this.
And I believe the issue is actually here, in kube-runtime
:
Which I'm reading as - if you get an invalid GameServer
on initiation of the watch operation, the watch operation won't start at all. Does that align with what you are seeing, or do you see other GameServer
s changes still be observed?
We could look out for error type WatchStartFailed
and ignore it, but that might be a problem for valid reasons we'd want to restart because the initial watch operation fell over (control plane being down might be one of them, or bad auth, etc).
GameServers do have set models, and I think it's valid to expect those models, but maybe the answer here is if we have fields that are optional in the CRD that we have as required in our Rust model, we should account for that better.
Or the other answer might be to go up to kube-runtime
and have a special error type for deserialisation / model translation issues? I'm not entirely sure 🤔
Thoughts?
What happened:
The Quilkin Agent failed to parse the list of gameserver specs from Agones:
and then it retries 10 times:
and after that the agent restarts which is quite disruptive for the gameserver cluster as new games failed to get connected to a Quilkin proxy during the downtime.
What you expected to happen:
That the Quilkin Agent can handle Agones gameserver specs that are missing fields due to being an older version or having gotten corrupted (some service directly changing the spec that makes it invalid). It should not constantly restart due to one invalid spec out of 100s.
How to reproduce it (as minimally and precisely as possible):
gameservers
namespace where the gameserver pods are running.Anything else we need to know?:
Environment: