seaweedfs / seaweedfs-csi-driver

SeaweedFS CSI Driver https://github.com/seaweedfs/seaweedfs
Apache License 2.0
210 stars 50 forks source link

chore(seaweedfs-csi-driver): delete unnecessary stage #151

Closed duanhongyi closed 8 months ago

duanhongyi commented 8 months ago

NodeStageVolume is an additional step for the block devices, it's used to partition and format the disk and mount the disk on a node global directory.The mount of weed uses fuse, similar to NFS, not a block device, so you don't need to use NodeStageVolume,

At the same time, this change is also to prepare for future use of requiresRepublish, which means there is a mechanism to restore the pod's mount when nodeserver error, without the need to restart the pod.

Ceph nfs did not use NodeStageVolume(Non block devices): https://github.com/ceph/ceph-csi/blob/devel/internal/nfs/nodeserver/nodeserver.go#L142

Ceph fs uses NodeStageVolume(block device): https://github.com/ceph/ceph-csi/blob/devel/internal/cephfs/nodeserver.go#L158

chrislusf commented 8 months ago

Thanks and good to learn!

kvaster commented 7 months ago

And what about caches ? i.e. if some volume will be mounted in several pods on the same node ? Before this PR only one process will be used for this, but now each mount will use it's own process for this.

kvaster commented 7 months ago

Also requiresRepublish is not a mechanism for restoring the pod's mount. See: https://kubernetes-csi.github.io/docs/csi-driver-object.html Especially: CSI drivers should only atomically update the contents of the volume. Mount point change will not be seen by a running container.

The problem is that we really need to propagate new mount point into the running container and currently there is no any mechanism for this.

kvaster commented 7 months ago

@duanhongyi, and it's definitely wrong assuming that NodeStageVolume is used just for block devices.

NodeStageVolume – Mounts a persistent volume to a global path on a worker node. NodePublishVolume – Mounts a persistent volume from the global path on a worker node to a pod.

From here: https://arslan.io/2018/06/21/how-to-write-a-container-storage-interface-csi-plugin/

NodeStageVolume: This method is called by the CO to temporarily mount the volume to a staging path. Usually this staging path is a global directory on the node. In Kubernetes, after it's mounted to the global directory, you mount it into the pod directory (via NodePublishVolume). The reason that mounting is a two step operation is because Kubernetes allows you to use a single volume by multiple pods. This is allowed when the storage system supports it (say NFS) or if all pods run on the same node. One thing to note is that you also need to format the volume if it's not formatted already. Keep that in mind.

NodePublishVolume: This method is called to mount the volume from staging to target path. Usually what you do here is a bind mount. A bind mount allows you to mount a path to a different path (instead of mounting a device to a path). In Kubernetes, this allows us for example to use the mounted volume from the staging path (i.e global directory) to the target path (pod directory). Here, formatting is not needed because we already did it in NodeStageVolume.

As a result NodeStageVolume is a wonderfull mechanism for ReadWriteMany volumes of any type. It allows to share mount many times on the same node.

chrislusf commented 7 months ago

@kvaster , seems @duanhongyi is in a different time zone. Let's revert this PR?

chrislusf commented 7 months ago

https://github.com/seaweedfs/seaweedfs-csi-driver/pull/153 is after this PR.

To revert this PR, https://github.com/seaweedfs/seaweedfs-csi-driver/pull/153 also need to be reverted.

kvaster commented 7 months ago

@chrislusf , my proposition is to wait for @duanhongyi answer. My personal position is to respect opinion and work of others. And it will be wonderfull if we all agree about the way to go.

kvaster commented 7 months ago

@duanhongyi, any comments ? If there are no any comments, my proposition is to revert this two PRs today.

duanhongyi commented 7 months ago

Sorry, I've been quite busy lately and don't have time for discussions; The original intention of mentioning this PR is to simplify the mounting process of CSI. In my opinion, it is unnecessary to share a mounting point among multiple nodes in one node. Instead, it increases the complexity of the code. Perhaps as you said, cache sharing is possible, but I think it is not necessary compared to the increase in complexity it brings.

@kvaster @chrislusf

kvaster commented 7 months ago

@duanhongyi , thanks for your answer! This is all about resources, not only cache. For use it saves lot's of resources (including network) when mount process is shared across many small pods on the node. I don't think that this brings too much complexity. But it brings flexibility and options without any other downside (except a little bit more complex code). Also it will not solve any of the problems mentioned - i.e remounting volume in case of upgrade or process crash.

duanhongyi commented 7 months ago

@duanhongyi , thanks for your answer! This is all about resources, not only cache. For use it saves lot's of resources (including network) when mount process is shared across many small pods on the node. I don't think that this brings too much complexity. But it brings flexibility and options without any other downside (except a little bit more complex code). Also it will not solve any of the problems mentioned - i.e remounting volume in case of upgrade or process crash.

Okay, you convinced me. The last two submissions can be rolled back.

chrislusf commented 7 months ago

ok. Reverted last two PRs.