Open mstrukci opened 2 years ago
Looks related to https://github.com/nolar/kopf/issues/955
We are seeing the same issue, on gcloud v1.22.12
and kopf 1.35.4
.
The first time we saw it was after kopf was running for a few days.
The second time today was, by coincidence, at the same time a new node pool was added to our gke cluster.
@mstrukci have you set settings.watching.client_timeout
as well (as https://github.com/nolar/kopf/issues/585 suggests)?
Hey @PidgeyBE, we haven't set it initially. I've updated our operator with this setting and so far it has not frozen, however, the issue is not consistent, so it's hard to say if it isn't just a coincidence. I'll try to update this thread later after some more time passes.
Hi @StrukcinskasMatas , my hypothesis is that it goes wrong when the kubernetes API is unavailable for some seconds (like in case a new node is added in Google cloud). Unfortunately I'm on (a long) holiday now and haven't had the chance to try to reproduce it. If you would have any way to simulate some downtime in the kubernetes API, that would probably allow you to reproduce the issue...
Looks like even with setting the client_timeout
attribute, the operator froze with the following event event='Stopping the watch-stream for X cluster-wide.'
We will probably look into working around this issue, by having a side service that sends events for the operator to process, and if it's not picked up, then restart the operator. In theory, this should solve our issue until it gets fixed by kopf.
Any updates on this issue?
Seen this issue happening again now (in gcloud, I assume when scaling up a node), but because we have:
settings.watching.connect_timeout = 60
settings.watching.server_timeout = 600
settings.watching.client_timeout = 610
the operator started working again after 10 minutes...
I've seen this occur as well, and in my case, the watcher tasks died. Therefore, setting a client timeout had no impact as the httpx client wasn't even being utilized. The operator was basically a zombie with no active watcher tasks.
Long story short
A handler of our custom resources stops responding after some time, despite having
connect_timeout
andserver_timeout
settings configured (referring to #585)Is there any way to debug this issue? Can we setup a liveness probe for such handlers to get indicators when they are 'frozen'?
Kopf version
1.35.4
Kubernetes version
v1.22.8
Python version
3.7.3
Code
No response
Logs
No response
Additional information
No response