Open shivamverma182 opened 3 years ago
Hello @nolar ,
Just FYI, Kubernetes we are using is AKS(Azure Kubernetes Service) 1.20.9
Hello!
I'm having the same issue with Azure Kubernetes service(1.21 version)
My controller works fine for some time, but then it's start logging Connection reset by peer
errors and after some time - stuck and not reacting to new events
Polls took quite a long time
[2021-10-26 13:44:01,569] kopf.objects [DEBUG ] [default/python37-test8] Handling cycle is finished, waiting for new changes.
[2021-10-26 13:44:06,574] asyncio [INFO ] poll 4999.486 ms took 5004.219 ms: timeout
[2021-10-26 13:48:33,326] asyncio [INFO ] poll took 266750.484 ms: 1 events
[2021-10-26 13:48:33,327] asyncio [DEBUG ] <_SelectorSocketTransport fd=9 read=polling write=<idle, bufsize=0>>: Fatal read error on socket transport
Traceback (most recent call last):
File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
[2021-10-26 13:48:33,335] asyncio [DEBUG ] poll took 7.351 ms: 1 events
[2021-10-26 13:48:33,335] asyncio [DEBUG ] <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>>: Fatal read error on socket transport
Traceback (most recent call last):
File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 814, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer
[2021-10-26 14:16:24,389] asyncio [INFO ] poll took 1671053.525 ms: 1 events
[2021-10-26 14:16:24,393] asyncio [DEBUG ] poll took 0.569 ms: 1 events
After that, if I add new CRD resource, controller will not log anything and will not react
@Rybue,
Try below settings. It worked in our case.
@kopf.on.startup() def configure(settings: kopf.OperatorSettings, **_): settings.watching.server_timeout = 60 settings.watching.connect_timeout = 60
Thanks @shivamverma182, I will try it this week
I have the same issue on AKS 1.21.2 (kopf 1.35.2, python 3.9) I'm trying @shivamverma182 settings right now and I will report soon.
I got a similar short timeout issue with a "watch-only" operator without kopf involved on AKS too
Thanks for this report !
It works with 60 seconds timeout. I tried 180 seconds (3 min) and it is working too.
The reason about that has been discussed here : https://github.com/Azure/AKS/issues/1755
It seems that around 4 min 10sec, the connection is dropped by a network component (SLB/BLB). It will send a TCP Reset but this is not detected by kopf which is not able to reconnect. https://github.com/Azure/AKS/issues/1755#issuecomment-712804832
It seems that TCP Keepalive needs to be set: https://github.com/Azure/AKS/issues/1755#issuecomment-719329135
But for now those settings seems to work (something around 225 should do it too): settings.watching.server_timeout = 180 settings.watching.connect_timeout = 180
I have looked into how SO_KEEPALIVE can be activated client-side in aiohttp or asyncio. It seems not so straightforward — and the complications come from the level of asyncio:
aiohttp
seems uses a little trick that works on the default event loop of asyncio, but is not guaranteed to work on alternative event loops (e.g. uvloop — to be checked):
But that flag — tcp_keepalive=True
— is only available for aiohttp servers, not clients. And it is "on" by default, by the way.
Worth noting: aiohttp also mentions some "keepalive connections" — but this looks more like socket reuse with possible cooldown times between uses (if I understood it properly). This is not the same as "keepalive sockets".
Yet it seems possible to create a client-side TCPConnector that overrides that socket flag for Kopf's own connections (primarily the streaming ones, but also all of them for simplicity). I have made a PR — can you please try it?
To install it from a git branch:
pip install git+https://github.com/nolar/kopf.git@so-keepalive
The diff from 1.35.2 to this branch (for review):
When trying this, please disable the client_timeout/server_timeout settings — they were used as a workaround for this problem before, so they might affect the experiment.
Sorry for the delay
So I did the test and unfortunately it is not working with so-keepalive. I have the same behavior with or without it.
That’s sad. Thank you for trying so fast.
I hoped that it will be an easy fix for this known problem — and there will be no need for Plan B. However, Plan B is the way, it seems.
For the Plan B, I wanted to introduce a new setting: “inactivity_timeout”. Unlike as with the client/server per-request timeouts, this one can have a meaningful default out of the box (e.g. 5 mins).
The nuance with the existing client/server timeouts is that they are for the good old HTTP request — they ignore the long-living nature of Kubernetes streams. So, they become the time cap of the whole stream.
Too low timeouts will cause frequent reconnects which will cause too much noise and unnecessary processing on large clusters. Too high timeouts are useless. This is why there are no defaults.
With the inactivity timeout, 5 mins (or even 2-3 mins) sounds good enough: on large clusters, inactivity will barely ever be seen, so it is a good sign of the connection loss; on small clusters, inactivity is normal, but reconnections will barely be noticeable, as there is not much to process.
FYI it is possible to configure the TCP reset timeout on the AKS Standard Load Balancer. Unfortunately I can NOT test it as we have an old setup with a Basic Load balancer with no way to migrate to SLB (except destroy/recreate).
By default it is set to 4min as already discussed but can go up to 30/100 min inbound/outbound.
That doesn't fix the issue but can provide flexibility to workaround it.
Long story short
I have created a controller which creates a custom object. It works fine as long as new custom objects are being created. But if it is idle for more than 5 min. the operator get hung and it does not respond to new custom objects being created.
I have checked its able to connect to kubernetes API server and also tried to set below.
@kopf.on.startup() def configure(settings: kopf.OperatorSettings, **_): settings.watching.server_timeout = 300
Kopf version
1.35.1
Kubernetes version
1.20.9
Python version
3.7
Code
Logs
Additional information
No response