kubernetes-csi / node-driver-registrar

Sidecar container that registers a CSI driver with the kubelet using the kubelet plugin registration mechanism.
Apache License 2.0
123 stars 133 forks source link

Node Driver Registrar restart at GetDriverName call #405

Open Anto74 opened 1 month ago

Anto74 commented 1 month ago

Hi there,

node-driver-registrar integrated in my CSI driver restarted a number of time at install, upgrade, node server pod restart.

Logs show the following: E0427 17:58:20.335386 8 main.go:170] error retreiving CSI driver name: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I use one second default value for timeout duration (--timeout parameter).

My customer requirement is to limit restarts as much as possible, with the aim to have no restart at all for any containers. So, I tried two seconds and I managed to avoid restarts. Anyway, it is not possible to be sure that two seconds timeout is always enough.

Is there any reason why node driver registrar container performs os.Exit(1) (causing restart) at first timeout without retrying? Is it possible to consider the opportunity to introduce a configurable number of reattempts?

Thanks in advance and best regards, Antonio

jsafrane commented 1 month ago

I am not sure how useful would a configurable nr. of attempts be. It's only a slightly better than setting the timeout to nr. of attempts * length of a single attempt. Anyway, I won't block such a PR, in case you submit it.

My customer requirement is to limit restarts as much as possible, with the aim to have no restart at all for any containers.

Anyway, it is not possible to be sure that two seconds timeout is always enough.

The timeout should be the longest time that the CSI driver takes to initialize and respond to Probe() or NodeGetInfo(). Of course sometimes the initialization can take longer, but again, occasional restart is not that bad.

Anto74 commented 1 month ago

Hi @jsafrane , at first, thank you so much for your very fast feedback.

I agree with you: occasional restarts are not so bad, but my CSI driver is a peculiar one. It works for TelCo grade applications, and some customers complain even for a single spontaneous container restart. Moreover, it is integrated with user application, and the the CSI driver initialization time is not perfectly predictable.

I'll go for a pull request as soon as I can.

Best regards, Antonio