Closed ejweber closed 6 months ago
Although that change sets a reasonable default for every other sidecar, the livenessprobe is special and as you say probably needs to either actually attempt to connect infinitely or ignore this error.
Maybe this could similarly lead to a liveness probe stuck permanently in initialization?
It could, but I don't think that's as big of a deal. In the node-driver-registrar case, this bug leads to the driver never getting registered on the node - effectively permanently bricking the driver until manual intervention. In the livenessprobe case, the pod will eventually fail the check and be restarted, so the worst case scenario is a temporary delay of driver startup in a very rare race condition.
To me it seems the liveness probe container should never crash when it can't talk to the container, it should just report failed probes. Would it be possible to achieve with the current connection lib? It would make sense even to use grpc directly, if our lib is not flexible enough.
cc @msau42
What about the caller sending the option grpc.WithTimeout(time.Second * 30)
? livenessprobe wouldn't sent this option, all the other sidecars would send it. Looks like the timeout is also set in ConnectWithoutMetrics
There is connlib.Connect(..., WithTimeout(0))
that disables the default 30 second timeout. I am testing it in https://github.com/kubernetes-csi/livenessprobe/issues/236.
I missed the possibility of setting it to 0 in https://github.com/kubernetes-csi/livenessprobe/pull/237, so I set it to 5 minutes. Your approach makes more logical sense.
@ejweber hey, sorry, I missed your PR... it was almost right!
The livenessprobe code expects to try forever to connect with the CSI plugin via
csi.sock
on startup.https://github.com/kubernetes-csi/livenessprobe/blob/33ea6c01f303d03e7e493e8e4622c08ce11722b8/cmd/livenessprobe/main.go#L142-L147
However, this commit recently picked up a change in csi-lib-utils that returns an error after only 30 seconds.
According to the associated PR, the goal was to avoid a deadlock in which
node-driver-registrar
failed permanently to connect to a CSI plugin because it was referencing an old file descriptor.In this analysis, I described a situation in which this new behavior caused a permanent outage of the Longhorn CSI plugin. Details are there, but essentially:
IMO, livenessprobe's previous behavior was correct. It should not crash unless it is misconfigured so it is always available to answer kubelet's liveness probes.
Assuming the csi-lib-utils change was necessary, my thinking is that we should recognize the timeout error in livenessprobe and ignore it during initialization. However, I'm not I understand the exact cause of https://github.com/kubernetes-csi/csi-lib-utils/pull/131. Maybe this could similarly lead to a liveness probe stuck permanently in initialization?
cc @ConnorJC3 from the csi-lib-utils PR for any thoughts.