controller built with kopf is stuck after being idle for 5-10 min

shivamverma182 commented 3 years ago

Long story short

I have created a controller which creates a custom object. It works fine as long as new custom objects are being created. But if it is idle for more than 5 min. the operator get hung and it does not respond to new custom objects being created.

I have checked its able to connect to kubernetes API server and also tried to set below.

@kopf.on.startup() def configure(settings: kopf.OperatorSettings, **_): settings.watching.server_timeout = 300

Kopf version

1.35.1

Kubernetes version

1.20.9

Python version

3.7

Code

import kopf
from kubernetes import client
import os
import yaml
import time

@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    settings.watching.server_timeout = 300

@kopf.on.create("subnamespacerequests")
def create_fn(spec, name, namespace, logger, **kwargs):

    labels = spec.get("labels")
    logger.debug(f"labels recieved: {labels}")
    annotations = spec.get("annotations")
    logger.debug(f"annotations receibed: {annotations}")

    path = os.path.join(os.path.dirname(__file__), "subnamespace.yaml")
    tmpl = open(path, "rt").read()
    text = tmpl.format(subnamespace=name, parentnamespace=namespace)
    data = yaml.safe_load(text)

    kopf.adopt(data)

    api = client.CustomObjectsApi()
    subns = api.create_namespaced_custom_object(
        group="hnc.x-k8s.io",
        version="v1alpha2",
        namespace=namespace,
        plural="subnamespaceanchors",
        body=data,
    )

    logger.debug(f"subnamespace {name} is created under namespace {namespace}")
    time.sleep(1)

    v1 = client.CoreV1Api()

    for item in v1.list_namespace().items:
        if item.metadata.name == name:
            logger.debug(f"Namespace {name} is created")
            ns = item

    logger.debug(f"existing labels: {ns.metadata.labels}")
    logger.debug(f"existing annotations: {ns.metadata.annotations}")
    ns.metadata.labels.update(labels)
    ns.metadata.annotations.update(annotations)
    patched_labels = ns.metadata.labels
    patched_annotations = ns.metadata.annotations
    logger.debug(f"patched labels: {patched_labels}")
    logger.debug(f"patched annotations: {patched_annotations}")
    patched_body = {
        "metadata": {"labels": patched_labels, "annotations": patched_annotations}
    }
    logger.debug(f"Patch body: {patched_body} ")

    patched_ns = v1.patch_namespace(name=name, body=patched_body)
    logger.debug(f"Namepace after adding labels: {patched_ns}")

    logger.info(
        f"Namespace {name} has been patched with requested labels and annotations"
    )

    name = subns.get("metadata").get("name")

    return {"subnamespace": name}

Logs

[2021-10-12 13:43:21,624] asyncio              [DEBUG   ] <_SelectorSocketTransport fd=7 read=polling write=<idle, bufsize=0>>: Fatal read error on socket transport
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
[2021-10-12 13:43:21,626] asyncio              [DEBUG   ] connect <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 0)> to ('10.0.0.1', 443)
[2021-10-12 13:43:21,628] asyncio              [DEBUG   ] poll 300378.628 ms took 1.245 ms: 1 events
[2021-10-12 13:43:21,629] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f659f5dce90> starts SSL handshake
[2021-10-12 13:43:21,653] asyncio              [DEBUG   ] poll 59999.448 ms took 23.555 ms: 1 events
[2021-10-12 13:43:21,655] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f659f5dce90>: SSL handshake took 25.8 ms
[2021-10-12 13:43:21,656] asyncio              [DEBUG   ] <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.15.232.86', 56862), raddr=('10.0.0.1', 443)> connected to 10.0.0.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f659f4ff790>, <aiohttp.client_proto.ResponseHandler object at 0x7f659f4e7d70>)

Additional information

No response

shivamverma182 commented 3 years ago

Hello @nolar ,

Just FYI, Kubernetes we are using is AKS(Azure Kubernetes Service) 1.20.9

Rybue commented 3 years ago

Hello!

I'm having the same issue with Azure Kubernetes service(1.21 version)

My controller works fine for some time, but then it's start logging Connection reset by peer errors and after some time - stuck and not reacting to new events

Polls took quite a long time

[2021-10-26 13:44:01,569] kopf.objects         [DEBUG   ] [default/python37-test8] Handling cycle is finished, waiting for new changes.
[2021-10-26 13:44:06,574] asyncio              [INFO    ] poll 4999.486 ms took 5004.219 ms: timeout
[2021-10-26 13:48:33,326] asyncio              [INFO    ] poll took 266750.484 ms: 1 events
[2021-10-26 13:48:33,327] asyncio              [DEBUG   ] <_SelectorSocketTransport fd=9 read=polling write=<idle, bufsize=0>>: Fatal read error on socket transport
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
[2021-10-26 13:48:33,335] asyncio              [DEBUG   ] poll took 7.351 ms: 1 events
[2021-10-26 13:48:33,335] asyncio              [DEBUG   ] <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>>: Fatal read error on socket transport
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
[2021-10-26 14:16:24,389] asyncio              [INFO    ] poll took 1671053.525 ms: 1 events
[2021-10-26 14:16:24,393] asyncio              [DEBUG   ] poll took 0.569 ms: 1 events

After that, if I add new CRD resource, controller will not log anything and will not react

shivamverma182 commented 3 years ago

@Rybue,

Try below settings. It worked in our case.

@kopf.on.startup() def configure(settings: kopf.OperatorSettings, **_): settings.watching.server_timeout = 60 settings.watching.connect_timeout = 60

Rybue commented 3 years ago

Thanks @shivamverma182, I will try it this week

mickybart commented 3 years ago

I have the same issue on AKS 1.21.2 (kopf 1.35.2, python 3.9) I'm trying @shivamverma182 settings right now and I will report soon.

I got a similar short timeout issue with a "watch-only" operator without kopf involved on AKS too

Thanks for this report !

mickybart commented 3 years ago

It works with 60 seconds timeout. I tried 180 seconds (3 min) and it is working too.

The reason about that has been discussed here : https://github.com/Azure/AKS/issues/1755

It seems that around 4 min 10sec, the connection is dropped by a network component (SLB/BLB). It will send a TCP Reset but this is not detected by kopf which is not able to reconnect. https://github.com/Azure/AKS/issues/1755#issuecomment-712804832

It seems that TCP Keepalive needs to be set: https://github.com/Azure/AKS/issues/1755#issuecomment-719329135

But for now those settings seems to work (something around 225 should do it too): settings.watching.server_timeout = 180 settings.watching.connect_timeout = 180

nolar commented 3 years ago

I have looked into how SO_KEEPALIVE can be activated client-side in aiohttp or asyncio. It seems not so straightforward — and the complications come from the level of asyncio:

https://github.com/python/asyncio/issues/286

aiohttp seems uses a little trick that works on the default event loop of asyncio, but is not guaranteed to work on alternative event loops (e.g. uvloop — to be checked):

https://github.com/aio-libs/aiohttp/blob/v3.8.0/aiohttp/tcp_helpers.py#L13-L16

But that flag — tcp_keepalive=True — is only available for aiohttp servers, not clients. And it is "on" by default, by the way.

https://github.com/aio-libs/aiohttp/blob/v3.8.0/aiohttp/web_protocol.py#L272-L273

Worth noting: aiohttp also mentions some "keepalive connections" — but this looks more like socket reuse with possible cooldown times between uses (if I understood it properly). This is not the same as "keepalive sockets".

Yet it seems possible to create a client-side TCPConnector that overrides that socket flag for Kopf's own connections (primarily the streaming ones, but also all of them for simplicity). I have made a PR — can you please try it?

https://github.com/nolar/kopf/pull/860

To install it from a git branch:

pip install git+https://github.com/nolar/kopf.git@so-keepalive

The diff from 1.35.2 to this branch (for review):

https://github.com/nolar/kopf/compare/1.35.2..so-keepalive

When trying this, please disable the client_timeout/server_timeout settings — they were used as a workaround for this problem before, so they might affect the experiment.

mickybart commented 3 years ago

Sorry for the delay

So I did the test and unfortunately it is not working with so-keepalive. I have the same behavior with or without it.

nolar commented 3 years ago

That’s sad. Thank you for trying so fast.

I hoped that it will be an easy fix for this known problem — and there will be no need for Plan B. However, Plan B is the way, it seems.

For the Plan B, I wanted to introduce a new setting: “inactivity_timeout”. Unlike as with the client/server per-request timeouts, this one can have a meaningful default out of the box (e.g. 5 mins).

The nuance with the existing client/server timeouts is that they are for the good old HTTP request — they ignore the long-living nature of Kubernetes streams. So, they become the time cap of the whole stream.

Too low timeouts will cause frequent reconnects which will cause too much noise and unnecessary processing on large clusters. Too high timeouts are useless. This is why there are no defaults.

With the inactivity timeout, 5 mins (or even 2-3 mins) sounds good enough: on large clusters, inactivity will barely ever be seen, so it is a good sign of the connection loss; on small clusters, inactivity is normal, but reconnections will barely be noticeable, as there is not much to process.

mickybart commented 2 years ago

FYI it is possible to configure the TCP reset timeout on the AKS Standard Load Balancer. Unfortunately I can NOT test it as we have an old setup with a Basic Load balancer with no way to migrate to SLB (except destroy/recreate).

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset#configurable-tcp-idle-timeout

By default it is set to 4min as already discussed but can go up to 30/100 min inbound/outbound.

That doesn't fix the issue but can provide flexibility to workaround it.

nolar / kopf