Investigate CSI driver for local volume

msau42 commented 6 years ago

Controller service would need to be run as a DaemonSet, instead of as a single Deployment. It may need to run an custom external provisioner sidecar that filters on node. Or maybe it can be generically extended to filter on a plugin's reported topology.
There is a proposal to allow CSI volumes to be specified inline instead of only through PV. I don't think it will be able to support topology because we implemented topology at the PV layer. We may want to recommend (require?) that this CSI driver is disabled as an inline volume source, via PSP. Both for security and functionality.
The external static provisioner does not follow the CSI lifecycle at all (except for Delete). We could potentially deploy it as a Kubernetes-only addon to the CSI driver.
Currently the external-attacher sidecar is required even if the plugin doesn't support attach. We should try to reduce the amount of resources it takes up in the meantime.
CSI has a way to report capacity per topology, but it is not plumbed through on the Kubernetes side. That needs to be designed. cc @verult

cc @lichuqiang

msau42 commented 6 years ago

/area local-volume

lichuqiang commented 6 years ago

It may need to run an custom external provisioner sidecar that filters on node. Or maybe it can be generically extended to filter on a plugin's reported topology.

Yeah, we need fast fail for volumes that not belong to a node. I noticed that in the provisioner controller lib, we already have Qualifier interface to tell if a claim should get provisioned. So, is it possible to add related interface in CSI, and fulfill it in the csi external-provisioner? As I think it a quite common demand, but not limited to local volume.

CSI has a way to report capacity per topology, but it is not plumbed through on the Kubernetes side.

Maybe a new volume plugin interface for this, just like the way we did for setVolumeLimits

lichuqiang commented 6 years ago

A downside is that we cannot support multi backends (SSD, HHD...) in a same provisioner. We'll have to deploy a set of components (provisioner, attacher and driver) for each of them.

msau42 commented 6 years ago

For 1), we may be able to use the GetNodeInfo call + a flag to csi-external-provisioner to indicate that we're running the provisioner per topology instance, so CreateVolume should be filtered.

For 5), here's the CSI call. From that we need to figure out how to represent capacity in K8s for non-node objects. We may need a new object for this because the topology keys returned by CSI can be anything.

msau42 commented 6 years ago

A downside is that we cannot support multi backends (SSD, HHD...) in a same provisioner.

Why is this? The provisioner calls based on provisioner name, not storageclass, so a single plugin should be able to handle both ssds and hdds.

lichuqiang commented 6 years ago

Why is this? The provisioner calls based on provisioner name, not storageclass, so a single plugin should be able to handle both ssds and hdds.

Mainly duo to capacity report. capacity of ssd/hdd should be reported separately. While currently in CSI GetCapacity interface, we can only return a single value.

Maybe we can consider to wrap the storage kinds as AccessibleTopology in requests, and call it multi times. But it's still unclear how to recognize available values.

msau42 commented 6 years ago

Ugh... That limitation might have been an oversight. Let's see if we can add it to csi

lichuqiang commented 6 years ago

For 1), we may be able to use the GetNodeInfo call + a flag to csi-external-provisioner to indicate that we're running the provisioner per topology instance, so CreateVolume should be filtered.

I like the idea, the only concern is that people may get confused about the usage of topology in claims(SCs). Normally the topology tells where a pv can be accessed from, but not where it can be provisioned :(

The two are same only for local volume. So I guess it not so common.

msau42 commented 6 years ago

Here is the CSI issue to add storageclass info to GetCapacity: https://github.com/container-storage-interface/spec/issues/55#issuecomment-399153658

For 1), it may be sufficient enough for the first phase to run a special external-provisioner only for local volumes. If more plugins need such a feature, we can consider a generic method later.

msau42 commented 6 years ago

StorageClass parameters are available in CSI GetCapacity arguments already: https://github.com/container-storage-interface/spec/blob/master/spec.md#getcapacity

saad-ali commented 6 years ago

@jieyu has started an CSI compatible LVM Local Storage driver here: https://github.com/mesosphere/csilvm

He's open to adding a Kubernetes directory for k8s specific deployment configs to that. In the future, we can consider moving that to the CSI org.

lichuqiang commented 6 years ago

I looked into the LVM driver @saad-ali mentioned above, it's overall a great job. But also has several points that cannot satisfy us from the perspective of k8 currently:

The driver takes a volume group name as input, and doesn't support multi vgs in a same driver. But for local volume, we expect to support multi storage backend (hdd, ssd...that exist as different vgs) in a same driver.
In GetCapacity, it returns free capacity of the group, but not total capacity, which is different from our expectation. Wondering if we got a consensus on this when defining CSI interface?
Most important, the driver does not support NodeStageVolume, and take block volume as input as default, and mount it to file path in NodePublishVolume. But for local volume, we need to support both block and file path as input, and manage global mount in NodeStageVolume, and bind mount in NodePublishVolume

lichuqiang commented 6 years ago

To support file path as input, I can think of two ways:

The first is to bind mount the path to global path in NodeStageVolume. Thus, the whole process would be more smooth, only NodeStageVolume will need to handle the two kinds of source specially, and other interfaces will be exactly the same. Downside is that we'll have an extra bind mount.
Do nothing in NodeStageVolume, and in NodePublishVolume, ignore the deviceMountPath passed in, and take volume source as source path. This way we can avoid the extra bind mount, but logic in the interfaces can be messy, we'll need extra judgment for the scenario in almost all of the node interfaces

Personally I'm more inclined to the first approach. WDYT?

msau42 commented 6 years ago

@jieyu, would you be open if we made changes to the lvm driver to address the 3 requirements listed above?

jieyu commented 6 years ago

cc @gpaul, @jdef

We're open to those suggestions.

For 2), that's what specified in CSI spec. For 3), can you guys elaborate a bit? We're definitely open to use NodeStageVolume if that makes the plugin logic simpler. For 1), I'm open to it. One question: have you guys thought about using one instance of the CSI plugin per volume group?

lichuqiang commented 6 years ago

For 2), that's what specified in CSI spec.

That's what I cared about most. If the field is designed for free capacity left, then it's not in line with our expectations. @msau42. maybe we can push the capacity response definition forward to include both total capacity and free capacity?

For 3), in k8s, we have the case that multi pods are in need of a same volume. Thus, we need to mount the source path to a global path in NodeStageVolume, and bind mount the global path to container paths in NodePublishVolume.

For 1), in our production environment, we already met the scenario that in need of multi storage backends (ssh, hdd...) on a same node . Having multi drivers could work, but the operation and maintenance folks are not happy with the fact that they'll have to deploy and maintain multi drivers.

gpaul commented 6 years ago

Definitely open to suggestions, yes.

For 1), in our production environment, we already met the scenario that in need of multi storage backends (ssh, hdd...) on a same node. Having multi drivers could work, but the operation and maintenance folks are not happy with the fact that they'll have to deploy and maintain multi drivers.

By "driver" do you mean "single instance of a plugin"? So multiple instances of the csilvm plugin, each powering a different volume group, would mean "multiple drivers" to you? In our case, we install the csilvm plugin only once but then launch it multiple times, once for every volume group.

We support hdds or ssds by listing the individual devices that comprise the volume group as command-line options when launching the plugin instance. For two volume groups, archive (consisting of HDDs) and fast (consisting of SSDs) we would launch two instances of the csilvm binary, once for each volume group.

We've found having a single VG per plugin instance to be a wonderful simplification to the code. It also makes error reporting in the CO simple as the CO can trivially tie any errors it receives to a specific volume group.

For 3), in k8s, we have the case that multi pods are in need of a same volume. Thus, we need to mount the source path to a global path in NodeStageVolume, and bind mount the global path to container paths in NodePublishVolume.

The pods use case seems very reasonable and I'd be very happy to have your help with supporting it.

lichuqiang commented 6 years ago

Hmm, have you an overall implementation besides the driver? i.e. provisioner, attacher. etc.

For k8s implementation, I think if we have each driver handle a group, we'll have to deploy a provisioner and attacher for they each accordingly.

And for failure recovery, the drivers will probably be deployed as pods(containers), rather than simply install and launch. So I think we'll have multi instances.

Personally, I'd prefer the concept of "driver" to represent a higher level of distinction: it's of cinder, ceph or lvm, but not the groups behind lvm.

gpaul commented 6 years ago

Hmm, have you an overall implementation besides the driver? i.e. provisioner, attacher. etc.

In Mesos each instance of a CSI plugin corresponds to a Storage Local Resource Provider (SLRP) which takes care of managing that plugin instance's lifecycle and communicates with the plugin according to the CSI spec. I believe the SLRP corresponds to both provisioner/attacher, but I'm not very familiar with k8s.

For k8s implementation, I think if we have each driver handle a group, we'll have to deploy a provisioner and attacher for they each accordingly.

That sounds likely, given that today we launch a SLRP per volume group (per CSI plugin instance, technically.)

And for failure recovery, the drivers will probably be deployed as pods(containers), rather than simply install and launch. So I think we'll have multi instances.

Makes sense.

Personally, I'd prefer the concept of "driver" to represent a higher level of distinction: it's of cinder, ceph or lvm, but not the groups behind lvm.

Agreed. It sounds like we're on the same page regarding terminology.

@jieyu given your greater familiarity with k8s and Mesos, what do you think?

jieyu commented 6 years ago

@gpaul @lichuqiang i got a bit lost here. What's the concrete proposed changes regarding csilvm repo?

lichuqiang commented 6 years ago

The 3 points above, and discussion mainly focus on the 1st point, i.e. whether to support multi volume groups in a driver

gpaul commented 6 years ago

@lichuqiang

re: (2.), if i understand correctly you agree that the plugin implements (2.) correctly? (From https://github.com/kubernetes-incubator/external-storage/issues/824#issuecomment-402008443)

re: (3.), that sounds very cool. I'd be happy for any help on this.

re: (1.), I'm not yet convinced that this is necessary and it adds significant complexity to the plugin. I'm currently opposed to making the default csilvm plugin instance drive multiple volume groups.

That said, if we can keep the logic sufficiently well-separated while maintaining the very simple single-vg-per-process mode I think it will be OK.

In that event, I'd like to generate two binaries: a simple, easy to debug and reason about one that manages a single volume group, the way we do currently, and a new one that adds the multiplexing layer on top of the core functionality exposed by the other.

This does not mean the second will need to literally exec the first, as multiple processes per container isn't ideal. Instead, I envision the new command will be a superset of the simpler one, with a multiplexing layer bolted on top.

lichuqiang commented 6 years ago

re: (1.), I'm not yet convinced that this is necessary and it adds significant complexity to the plugin. I'm currently opposed to making the default csilvm plugin instance drive multiple volume groups.

Take is as a compatible CSI driver, I can understand your concern. But at least for the k8s case, I can see several reasons that we are longing for the change (let the driver handle multi groups). Besides the deployment convenience it brings, here is another issue about the pvovisioner:

In a cluster with multiple types of storage (CSI and other storages that not implemented via CSI), the provisioner need to be able to tell if it's responsible for an incoming volume creation request. And if we have each driver handle a single group, that means the provisioner need to be aware of the driver info of the request. Unfortunately, currently the provisioner makes use of a shared controller lib, it filters requests according to plugin names ( like CSI, GCE, AWS...)and won't be aware of CSI-specific things.

And also, I really don't like the idea of having each provisoner take care of a single driver. in k8s, the provisioner need to watch changes in data, a provisioner means an extra connection to apiserver, this might have effect on performance when the cluster is of large scale.

gpaul commented 6 years ago

Those are good points, thanks for explaining. In that case I'm leaning towards adding this according to the preliminary design I outlined in my comment:

That said, if we can keep the logic sufficiently well-separated while maintaining the very simple single-vg-per-process mode I think it will be OK.

In that event, I'd like to generate two binaries: a simple, easy to debug and reason about one that manages a single volume group, the way we do currently, and a new one that adds the multiplexing layer on top of the core functionality exposed by the other.

This does not mean the second will need to literally exec the first, as multiple processes per container isn't ideal. Instead, I envision the new command will be a superset of the simpler one, with a multiplexing layer bolted on top.

Is this work something you would consider contributing? I would love to have good k8s support in the CSILVM plugin.

lichuqiang commented 6 years ago

This does not mean the second will need to literally exec the first, as multiple processes per container isn't ideal. Instead, I envision the new command will be a superset of the simpler one, with a multiplexing layer bolted on top.

Sounds good to me. So I think we don't need two separate binaries indeed, just let it decide whether to load the multiplexing layer according to flags upon startup. Is that what you mean?

msau42 commented 6 years ago

One more detail to iron out. IIUC, the lvmcsi driver accepts a list of devices and creates the vg for them. Can it handle situations like:

disks are dynamically added/removed from the node
different nodes have different numbers of disks at different paths

msau42 commented 6 years ago

Also regarding 2), I think we need to have two capacities reported in CSI, currently available and total (including used). In general, Kubernetes does not use currently available capacity because it adds significant scheduling latency to have to wait for the available capacity to be updated.

jieyu commented 6 years ago

xref some relevant discussion regarding total vs. available capacity from GetCapacity call. https://github.com/container-storage-interface/spec/pull/85#issuecomment-323154817

msau42 commented 6 years ago

The problem I see with only reporting available capacity, is that you will need to call GetCapacity before every CreateVolume request, and you won't have accurate available capacity reported if you have multiple CreateVolume requests outstanding in parallel. So I think we need to add back a total_capacity field so that COs have the option to be smarter with capacity tracking.

gpaul commented 6 years ago

IIUC, the lvmcsi driver accepts a list of devices and creates the vg for them. Can it handle situations like:

disks are dynamically added/removed from the node

The csilvm plugin is a simple layer on top of lvm2's command-line utilities. If the disks are present when the volume group is created and lvm2 can see their associated devices, then there shouldn't be a problem.

Similarly, disks that break or devices that disappear from a lvm2 volume group should show the same failure behaviour that would normally expected from a lvm2 volume group and the broken volume group will manifest as error responses to RPC performed by the CO.

I've done very little testing around these kinds of failures and the kind of lvm2 volume group parameters through which one might configure lvm2 RAID or other protection is still future work. For now it is assumed that data protection is configured at the application level (database replication) or the hardware level (RAID controllers) or even software raid (mdraid), so that the individual devices passed to the plugin have underlying redundancy already.

different nodes have different numbers of disks at different paths

As far as csilvm is concerned every volume group is constructed separately from an arbitrary number of devices, so as long as the devices present under /dev I think there shouldn't be a problem.

lichuqiang commented 6 years ago

The problem I see with only reporting available capacity, is that you will need to call GetCapacity before every CreateVolume request, and you won't have accurate available capacity reported if you have multiple CreateVolume requests outstanding in parallel. So I think we need to add back a total_capacity field so that COs have the option to be smarter with capacity tracking.

+1

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

jdef commented 5 years ago

@lichuqiang are you still interested in this?

msau42 commented 5 years ago

/remove-lifecycle stale

We are still interested in supporting this in Kubernetes. cc @cofyc who is doing some prototyping

cofyc commented 5 years ago

/assign

matthias50 commented 5 years ago

@cofyc I am also interesting in doing a prototype deployment on k8s? Do you have anything which could be used to deploy csilvm on a non production cluster to play around with yet?

matthias50 commented 5 years ago

@cofyc @jdef @lichuqiang I have been doing some experimenting with integrating csilvm with Kubernetes. I have found what may be a fairly significant issue. The behavior I am seeing is that the actual Linux LV is created on a random node in the K8s cluster which is running csilvm, not on the actual Node where the Pod gets scheduled.

The good news is that if I manually steer the Pod to the node with the storage, it works, so some basics seem to be ok. I think what is going on is that the CreateVolume Controller gRPC call, not a Node call. For something like an EBS volume, this is fine, because any node an create an EBS volume and then a different node can mount it later. For Linux LVs, this is obviously a no go.

I think if the Linux LV creation is moved from CreateVolume/DeleteVolume to NodeStageVolume/NodeUnStageVolume that this issue will be resolved. Does this make sense to you?

Once concern I have, which I honestly don't see a away around, is that the GetCapacity call is at the Controller level and there doesn't seem to be any corresponding call at the Node level. Given this setup, I'm concerned that what might happen in Kubernetes is that a Pod could get scheduled to a Node and then Kubernetes will be unable to provision the local volume it wants because it made the scheduling decision without regard to the amount of local storage which is left on that node.

msau42 commented 5 years ago

For supporting local volumes, I think the controller services will have to be running in the daemonset too. And the csi-provisioner probably can't be used as is. It would have to be modified to only trigger provisioning if the selected node matches the node it's running on.

But I think the main blocking issue, as you pointed out, is supporting the GetCapacity call. We don't have a way to represent capacity for general topology in Kubernetes, so that has to be designed. In addition, the current semantics of the CSI GetCapacity call is not architecturally compatible with how resources/capacity are surfaced and used by the Kubernetes scheduler. Some more discussion here. I unfortunately haven't had time to think deeply about this and how other non-local storage systems could be supported, and would greatly appreciate if someone could do some deeper investigation here.

matthias50 commented 5 years ago

As a bit of background, the use case I an interested in getting working is an ephemeral local volume backed by a LVM on a Kubernetes Node. See https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html. If we configure the storage class using the lvm provisioner to have volumeBindingMode set to WaitForFirstConsumer, then it seems like all that is needed is for the NodeStageVolume to create the lv and format the file system and then NodePublishVolume to bind mount from the global location used in NodeStageVolume to the pod directory.

Of course this still leaves the obvious hole here that the Kubernetes scheduler has no idea if there is actually enough storage on the node to ensure these Node gRPC calls have any chance of success, but I'm setting this issue to the side for a moment.

gpaul commented 5 years ago

I would very much like to push this forward.

@matthias50 I don't think we should let volume creation piggy-back on NodeStageVolume, at least not in the long run. It sounds like a sensible short-term solution to your use case, but longer-term I think it will lead to confusion.

But I think the main blocking issue, as you pointed out, is supporting the GetCapacity call. We don't have a way to represent capacity for general topology in Kubernetes, so that has to be designed. In addition, the current semantics of the CSI GetCapacity call is not architecturally compatible with how resources/capacity are surfaced and used by the Kubernetes scheduler. Some more discussion here. I unfortunately haven't had time to think deeply about this and how other non-local storage systems could be supported, and would greatly appreciate if someone could do some deeper investigation here.

@msau42 I'd be happy to contribute research / design / code / general effort. I am slowly grokking how CSI and storage in general works in k8s (my relevant background is CSI in Mesos) so any pointers, links to open discussions or design docs relating to this work would be much appreciated.

cofyc commented 5 years ago

hi, @matthias50 @gpaul

I was working in Kubernetes local storage and volume scheduling over the last year. If you have any questions about k8s, just ask me. My next goal is to support dynamic provisioning for local storage and other topology-aware storage types. We can work together.

I'm reading the csilvm project and related discussions and docs and writing a demo to demonstrate the design. I'm also writing a proposal based on @msau42 's design. I'll share it when it's ready for review.

gpaul commented 5 years ago

That's great to hear, @cofyc! Looking forward to hearing more.

matthias50 commented 5 years ago

@gpaul > @matthias50 I don't think we should let volume creation piggy-back on NodeStageVolume, at least not in the long run. It sounds like a sensible short-term solution to your use case, but longer-term I think it will lead to confusion.

Do you have another suggestion as to a way to go about this? Right now, I don't see any other way to make sure that the volume is provisioned from the underlying storage on the node where it needs to be published.

gpaul commented 5 years ago

Do you have another suggestion as to a way to go about this? Right now, I don't see any other way to make sure that the volume is provisioned from the underlying storage on the node where it needs to be published.

No I don't, sorry. I'm hoping a design crystallizes out of the discussion here: https://github.com/container-storage-interface/spec/issues/301#issuecomment-495497067

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

jdef commented 5 years ago

/remove-lifecycle stale

jdef commented 5 years ago

This doc seems related, thought I don't see that it links to a related KEP: https://docs.google.com/document/d/1WtX2lRJjZ03RBdzQIZY3IOvmoYiF5JxDX35-SsCIAfg/edit?disco=AAAADMBYnPc&ts=5d15db36&usp=comment_email_document&usp_dm=false

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

kubernetes-retired / external-storage

Investigate CSI driver for local volume #824