HTTP2_PING_TIMEOUT_SECONDS+HTTP2_READ_IDLE_TIMEOUT_SECONDS > 40s will cause all node NotReady

chenk008 commented 7 months ago

What happened?

The controller-manager node-lifecycle-controller watches node lease and update nodeHealthMap probeTimestamp , and it check node probeTimestamp is before now - node-monitor-grace-period. The node-monitor-grace-period default value is 40s.

When the client connection lost, the lease watcher cannot receive new event. With the http2 health check, HTTP2_PING_TIMEOUT_SECONDS default is 30s, HTTP2_READ_IDLE_TIMEOUT_SECONDS default is 15s, it will delay 45s to kill http2 connection.

During this period, node-lifecycle-controller will mark all nodes NotReady.

Here are some logs

I1103 23:00:28.464015       1 node_lifecycle_controller.go:1097] node 10.0.10.240 hasn't been updated for 40.010894415s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2023-11-03 22:57:18 +0800 CST,LastTransitionTime:2023-04-21 14:39:17 +0800 CST,Reason:KubeletReady,Message:kubelet is posting ready status,}

W1103 23:00:41.408493       1 reflector.go:441] k8s.io/client-go/informers/factory.go:134: watch of *v1.Lease ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

What did you expect to happen?

When the http2 connection lost, the watch should stop as soon as possible. The reflector should relist resource, and node-lifecycle-controller should not mark all node NotReady.

HTTP2_PING_TIMEOUT_SECONDS+HTTP2_READ_IDLE_TIMEOUT_SECONDS should be less than 40s.

How can we reproduce it (as minimally and precisely as possible)?

Find the controller-manager node-lifecycle-controller watch client srcIP:srcPort. Drop all tcp package send into srcIP:srcPort.
Wait for more than 40 seconds, all node will be marked NotReady
All node will be Ready after some seconds.

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version # Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.3", GitCommit:"", GitTreeState:"clean", BuildDate:"2023-10-19T07:10:40Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

chenk008 commented 7 months ago

How about double-check node lease before marking node NotReady? The node-lifecycle-controller can get reference lease from apiserver with resourceversion=''

neolit123 commented 7 months ago

/sig node

chenk008 commented 7 months ago

/sig api-machinery

aojea commented 7 months ago

/cc @aojea @liggitt

We have discussed this timeouts values before @andrewsykim @linxiulei and myself, I also think @linxiulei got some metrics with lower timeouts for the HTTP2 connections.

In this situation is clear that the defaults of the node watcher and the timeouts are related and having good defaults will be ideal, but the question is which values do we change ?

chenk008 commented 7 months ago

@aojea I think in internal network, the ping response almost return in 1 second, the HTTP2_PING_TIMEOUT_SECONDS default value should be 5 seconds. And the total timeout will 35 seconds. WDYT?

SergeyKanzhelev commented 7 months ago

/triage accepted

We have discussed this timeouts values before @andrewsykim @linxiulei and myself, I also think @linxiulei got some metrics with lower timeouts for the HTTP2 connections.

In this situation is clear that the defaults of the node watcher and the timeouts are related and having good defaults will be ideal, but the question is which values do we change ?

@aojea any pointers in other places where those timeouts were changed for http2? Any notes from those discussions?

SergeyKanzhelev commented 7 months ago

/priority important-longterm

since it is not a regression I will keep it longterm priority and not higher. But we can reevaluate if we hear more reports about it.

Vincentzs commented 7 months ago

/assign

jmcmeek commented 1 month ago

I'm investigating problems similar to this, but I'm looking for a way to confirm this is what is happening - and if there are tips for how to recreate such a scenario, that would be very helpful.

We get infrequent reports of "worker goes NotReady for a short time" incidents that appear to be triggered by events like failing over a LB sitting front of the kube-apiservers.

We've done some tests with a simple client-go app and have recreated a scenario in which the app logs http2: client connection lost messages after 45 secs of "context exceeded" errors.

The failures reported to us from customers don't have http2 errors in the kubelet logs, and our attempts to recreate with kubelet have never shown more more than a single error which works on subsequent retry.

kubernetes / kubernetes