Enhance NFS Mount Efficiency with Stage/Unstage Volume Capability

woehrl01 commented 6 months ago

Is your feature request related to a problem?/Why is this needed

Describe the solution you'd like in detail

I would like to propose an enhancement that focuses on optimizing NFS mount operations. This feature aims to improve resource utilization and reduce startup times for pods accessing NFS servers. A similar mounting behaviour exists on the ebs csidriver or juicefs csidriver.

The core idea is to introduce an option that leverages the stage and unstage volume capabilities of the CSI driver. The proposed changes include:

Single NFS Server Mount: Mount the NFS server only once for each unique combination of server name, export name, and mount options.
Bind Mounts for Pods: Implement actual bind mounts for each pod accessing the NFS server. This approach should also support subpaths for each pod.
Mount Management: Ensure that the mount operation occurs once per unique combination mentioned above, preventing redundant mounts (or simply the volumeid of the pv).

This enhancement brings several key benefits:

Reduced Mount Operations: By mounting the NFS server less frequently, we can significantly reduce the number of mount operations that the NFS server has to handle.
Improved Cache Utilization: With fewer mounts, cache usage becomes more efficient, enhancing overall system performance.
Faster Startup Times for Pods: Pods accessing the NFS server will experience quicker startup times, leading to more efficient deployments and scaling operations.

Describe alternatives you've considered

An alternative could be using a daemonset which mounts the nfs servers to the host, which then are bind mounted via hostpath into the pod. The problem is here that it hides the fact in the pod that a nfs is used and could be less reliable.

Additional context

andyzhangx commented 6 months ago

@woehrl01 thanks for raising this issue. I agree that add NodeStageVolume support would reduce the nfs mount since it's per pv mount per node, while it would raise the other issue, e.g. NodeStageVolume does not respect fsGroupChangePolicy (SecurityConext support), NodePublishVolume does, you could find more details here: https://github.com/kubernetes-sigs/azurefile-csi-driver/issues/1224#issuecomment-1517861487

There is a performance and k8s compliance tradeoff between whether supports NodeStageVolume or not, I am not sure what's the right way for such requirement.

cc @jsafrane @gnufied any ideas whether we need to implement NodeStageVolume or not?

woehrl01 commented 6 months ago

@andyzhangx thank you for mentioning this problem, I wasn't aware of this discussions, yet.

I'm curious if this actually is an issue in that case. If the stage volume is only creating the initial mount for the export root of the nfs server. The publish volume step can still set the fsgroup on the actual (sub) mount point which is created by the bind mount.

As I'm not an expert in fsgroup, what am I missing in that case?

andyzhangx commented 6 months ago

@andyzhangx thank you for mentioning this problem, I wasn't aware of this discussions, yet.

I'm curious if this actually is an issue in that case. If the stage volume is only creating the initial mount for the export root of the nfs server. The publish volume step can still set the fsgroup on the actual (sub) mount point which is created by the bind mount.

As I'm not an expert in fsgroup, what am I missing in that case?

@woehrl01 support you have a nfs mount with gid=x, and then set gid=y in bind mount path, then the original nfs mount would also have gid=y

woehrl01 commented 6 months ago

@andyzhangx I see, thank you. That's an interesting behaviour I wasn't aware about.

I found https://bindfs.org/ which could be a possible solution for that bind mount behaviour.

It still would be great to have this option as a featureflag, if this behaviour of fsgroup is documented.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

woehrl01 commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-csi / csi-driver-nfs

Enhance NFS Mount Efficiency with Stage/Unstage Volume Capability #573