New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage

ejweber commented 7 months ago

The livenessprobe code expects to try forever to connect with the CSI plugin via csi.sock on startup.

https://github.com/kubernetes-csi/livenessprobe/blob/33ea6c01f303d03e7e493e8e4622c08ce11722b8/cmd/livenessprobe/main.go#L142-L147

However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.

According to the associated PR, the goal was to avoid a deadlock in which node-driver-registrar failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.

In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:

The CSI plugin fails to start for an ephemeral reason and enters a CrashLoopBackOff.
livenessprobe fails to connect and enters a CrashLoopBackOff.
Eventually, the CSI plugin can start successfully. Since livenessprobe is not running at that time, kubelet kills it, increasing the backoff.
Every time livenessprobe starts, the CSI plugin is waiting in backoff, so livenessprobe crashes, increasing the backoff.

IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.

Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of https://github.com/kubernetes-csi/csi-lib-utils/pull/131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.

ConnorJC3 commented 7 months ago

Although that change sets a reasonable default for every other sidecar, the livenessprobe is special and as you say probably needs to either actually attempt to connect infinitely or ignore this error.

Maybe this could similarly lead to a liveness probe stuck permanently in initialization?

It could, but I don't think that's as big of a deal. In the node-driver-registrar case, this bug leads to the driver never getting registered on the node - effectively permanently bricking the driver until manual intervention. In the livenessprobe case, the pod will eventually fail the check and be restarted, so the worst case scenario is a temporary delay of driver startup in a very rare race condition.

jsafrane commented 7 months ago

To me it seems the liveness probe container should never crash when it can't talk to the container, it should just report failed probes. Would it be possible to achieve with the current connection lib? It would make sense even to use grpc directly, if our lib is not flexible enough.

cc @msau42

mauriciopoppe commented 7 months ago

What about the caller sending the option grpc.WithTimeout(time.Second * 30)? livenessprobe wouldn't sent this option, all the other sidecars would send it. Looks like the timeout is also set in ConnectWithoutMetrics

jsafrane commented 6 months ago

There is connlib.Connect(..., WithTimeout(0)) that disables the default 30 second timeout. I am testing it in https://github.com/kubernetes-csi/livenessprobe/issues/236.

ejweber commented 6 months ago

I missed the possibility of setting it to 0 in https://github.com/kubernetes-csi/livenessprobe/pull/237, so I set it to 5 minutes. Your approach makes more logical sense.

jsafrane commented 6 months ago

@ejweber hey, sorry, I missed your PR... it was almost right!

kubernetes-csi / livenessprobe

New csi-lib-utils/connection.Connect logic can cause permanent CSI plugin outage #236