nolar / kopf

A Python framework to write Kubernetes operators in just a few lines of code
https://kopf.readthedocs.io/
MIT License
2.13k stars 163 forks source link

Handler stops responding after some time #957

Open mstrukci opened 2 years ago

mstrukci commented 2 years ago

Long story short

A handler of our custom resources stops responding after some time, despite having connect_timeout and server_timeout settings configured (referring to #585)

Is there any way to debug this issue? Can we setup a liveness probe for such handlers to get indicators when they are 'frozen'?

Kopf version

1.35.4

Kubernetes version

v1.22.8

Python version

3.7.3

Code

No response

Logs

No response

Additional information

No response

PidgeyBE commented 2 years ago

Looks related to https://github.com/nolar/kopf/issues/955

We are seeing the same issue, on gcloud v1.22.12 and kopf 1.35.4. The first time we saw it was after kopf was running for a few days. The second time today was, by coincidence, at the same time a new node pool was added to our gke cluster.

PidgeyBE commented 2 years ago

@mstrukci have you set settings.watching.client_timeout as well (as https://github.com/nolar/kopf/issues/585 suggests)?

StrukcinskasMatas commented 2 years ago

Hey @PidgeyBE, we haven't set it initially. I've updated our operator with this setting and so far it has not frozen, however, the issue is not consistent, so it's hard to say if it isn't just a coincidence. I'll try to update this thread later after some more time passes.

PidgeyBE commented 2 years ago

Hi @StrukcinskasMatas , my hypothesis is that it goes wrong when the kubernetes API is unavailable for some seconds (like in case a new node is added in Google cloud). Unfortunately I'm on (a long) holiday now and haven't had the chance to try to reproduce it. If you would have any way to simulate some downtime in the kubernetes API, that would probably allow you to reproduce the issue...

StrukcinskasMatas commented 2 years ago

Looks like even with setting the client_timeout attribute, the operator froze with the following event event='Stopping the watch-stream for X cluster-wide.'

We will probably look into working around this issue, by having a side service that sends events for the operator to process, and if it's not picked up, then restart the operator. In theory, this should solve our issue until it gets fixed by kopf.

bernardio commented 1 year ago

Any updates on this issue?

PidgeyBE commented 1 year ago

Seen this issue happening again now (in gcloud, I assume when scaling up a node), but because we have:

    settings.watching.connect_timeout = 60
    settings.watching.server_timeout = 600
    settings.watching.client_timeout = 610

the operator started working again after 10 minutes...

james-mchugh commented 2 months ago

I've seen this occur as well, and in my case, the watcher tasks died. Therefore, setting a client timeout had no impact as the httpx client wasn't even being utilized. The operator was basically a zombie with no active watcher tasks.