kubernetes-retired / external-storage

[EOL] External storage plugins, provisioners, and helper libraries
Apache License 2.0
2.7k stars 1.6k forks source link

OpenEBS node-disk-manager #736

Closed jsafrane closed 5 years ago

jsafrane commented 6 years ago

What's our opinion about OpenEBS node disk manager (NDM)?

https://github.com/openebs/node-disk-manager https://github.com/openebs/node-disk-manager/pull/1 https://docs.google.com/presentation/d/1XcCWQL_WfhGzNjIlnL1b0kpiCvqKaUtEh9XXU2gypn4/

We could probably save some effort on both sides if we cooperate. For example, NDM's StoragePool idea looks like our LVM-based dynamic provisioner. And I personally like automated discovery of local disks that I'd need to deploy Gluster or Ceph on top of local PVs.

To me it seems that NDM is trying to solve similar use case as we, it's only more focused on the installation / discovery of the devices to consume as PVs, while Kubernetes focuses on the runtime aspects how to use the local devices (i.e. schedule and run pods). IMO, it would make sense to merge NDM with our local provisioner or at least make the integration as easy as possible for both sides.

/area local-volume @ianchakeres @msau42 @davidz627 @cofyc @dhirajh @humblec ? (did I forget anyone?)

humblec commented 6 years ago

@jsafrane I have a very good opinion about node disk manager effort from openebs. Thats one of the reason for me to actually involve in some discussions around this with OpenEBS folks so the mentions about gluster operator and such in the design proposal. In reality, node disk discovery and handling of these components is a common problem. Different vendors solve this issue in different manner. Having a good common solution is a must thing considering storage handling is important in an orchestrator like Kube. It also helps to avoid many vendor specific efforts with limitations here and there. Our local storage provisioner is a (good) start, however the openebs proposal has some good thoughts which could be adopted or sharpened with a community contribution and possibly merge with local volume provisioner. In short, I am all for it and eagerly waiting for next plan of actions.

@kmova @umamukkara @epowell Additional Reference#
https://blog.openebs.io/achieving-native-hyper-convergence-in-kubernetes-cb93e0bcf5d3

https://github.com/kubernetes/kubernetes/issues/58569

msau42 commented 6 years ago

I took a quick glance and it looks promising at providing an end to end disk management solution for distributed storage applications. The metrics and health monitoring aspects look very useful, and it should solve the issue of managing disks for DaemonSet-based providers.

I'm trying to think about how this could be integrated with the local-pv-provisioner from two angles:

For both use cases, I think there are still some challenges around categorization of disks to StoragePools that would need to be ironed out. IIUC, ndm creates a Disk object for every block device in the system, so it would be up to the StoragePool implementation to further filter which Disks to use. And an implementation MUST filter out the disks, otherwise it could end up stepping on the root filesystem or other K8s volume plugins.

Filtering using Disk.Vendor + Disk.Model may be sufficient if you want all similar disks to be in the same StoragePool. The challenges I see are about how to support more advanced disk configurations:

Local PV provisioner didn't solve this and instead required users to prep and categorize the disks beforehand. While I can see some simpler use cases being simplified by ndm, I'm not sure what is the best way to solve the more advanced ones.

kmova commented 6 years ago

@msau42 @jsafrane Thanks for the review and inputs!

@humblec and I have been discussing on how to keep the disk inventory and storage pool implementation generic so it can be used in multiple scenarios. We have made some progress on the following (will shortly update the design PR):

We definitely need more help/feedback in terms of advanced usecases and API design.

msau42 commented 6 years ago

@kmova I'm wondering if it would be simpler to use PVs as your disk inventory instead of a new Disk CRD object. The advantage is that you can reuse existing PVC/PV implementation to handle dynamic provisioning, and attaching of volumes to nodes.

dhirajh commented 6 years ago

@kmova @msau42 I am trying to understand the slide titled "Complementing Local PV". Currently the local provisioner crawls through the discovery directory to find volumes to create PVs for. It appears that with NDM one could add another form of discovery where NDM uses its own discovery mechanism to create local PVs. This seems like a useful enhancement to me assuming its adding another mechanism and not replacing existing discovery mechanism.

I guess the question about using Disk CRs, I would like to better understand what information it actually stores. I assume to support operations like unplugging and moving disks, the Disk CR stores more information than one would put in a Local PV. Its life cycle might also be a bit different from a PV as a result. If that is the case, then keeping the disk CR might make sense. Again, need to understand what the information in the CR is and how it is used.

kmova commented 6 years ago

@msau42 - Using PV in place of a new Disk CR, I was getting into the following challenges:

Another consideration was from the usability/operations perspective. for example: the management tools around kubernetes like weave scope that can represent these disks as visual elements, with ability to blink, get iostats etc.,

kmova commented 6 years ago

@dhirajh This disk CR can store details like:

capacity
serial number
model
vendor
physical location (ie enclosure slot number)
rotational speed (if hard disk)
sector size
write cache
FW revision level
extend log pages

In addition, as part of dynamic attributes or monitored metrics:

state (online, removed, etc)
status (normal, faulted)
temperature (when applicable) 
smart errors
msau42 commented 6 years ago

@kmova I think it's still possible to use PVs for inventory management. You don't necessarily have to mount PVC directly in the spec. If it's in your future roadmap to support all kinds of volume types, such as cloud block storage, then using the PVC abstraction could be used to provide dynamic provisioning and disk attachment capabilities as well.

kmova commented 6 years ago

@msau42 - IIUC the PVs can be created by the ndm and the additional disk attributes could be added under annotations (or may be under an extended spec?). For covering the case for https://github.com/kubernetes/kubernetes/issues/58569 - the pod can still mount the "/dev" and the configuration can specify the PV objects it can use - which will have the path information.

I like the idea of using the PVC abstractions for dynamic provisioning. How do we get the PVs attached to the node without adding them to a Deployment/App spec?

msau42 commented 6 years ago

Getting the PVs attached to the node is the hard part because it is tied to Pod scheduling. Having a Pod per PV is probably not going to scale, and you have to handle cases like the pod getting evicted. I'm not sure if leveraging the VolumeAttachement object would work, it may conflict/confuse the Attach/Detach controller.

humblec commented 6 years ago

@msau42 I do think, expanding volumeattachment object for local storage/disk handling could complex things. Its good to have it on other/new api object or some custom CRDs like ndm currently has. If custom CRD for a disk object is not optimal , we may think about a new api object for disk/local storage handling IMO.

@kmova I feel, we should also have node mapping in Disk CRD. That will help us a lot when considering scheduling or backtracking or some decision making based on this object.

kmova commented 6 years ago

@humblec - yes we can get the topology labels from the node where the disks are discovered and attach them to the Disk objects.

Example:

kind: Disk
metadata:
  name: disk-3d8b1efad969208d6bf5971f2c34dd0c
  labels:
    "kubernetes.io/hostname": "gke-openebs-user-default-pool-044afcb8-bmc0"

In addition, I have included based on feedback to be able to fetch additional information that can describe how disks are attached - via internal bus, HBA, or SAS expanders etc., This information can be used while provisioning latency sensitive pools.

msau42 commented 6 years ago

Agree, I don't see a great way to handle attached disk types without always forcing some Pod to be on the node. I think Disk CRD could work fine if you only plan on supporting local disks. But since other volume types were mentioned in the roadmap, I was trying to envision how things like provisioning and attaching could be supported without having to reimplement volume plugins and much of the Kubernetes volume subsystem.

rootfs commented 6 years ago

cc @travisn @bassam for rook and @jcsp for ceph

I'd like see some convergence on disk object schemas

msau42 commented 6 years ago

As an alternative datapoint, I spoke a bit with @dhirajh about how they deploy Ceph in their datacenter. He mentioned that they use StatefulSets, and each replica (OSD) manages just one local PV. All replicas use the same class and capacity of disk, and instead, there is a higher level operator that manages multiple StatefulSets and balances them across fault domains (ie rack). This operator is in charge of making sure that capacity is equal across fault domains, and can scale up each StatefulSet when more Ceph capacity is requested. With this architecture, they don't need their Ceph pods to manage multiple disks, and a disk failure is contained to a single replica, so they can use PVCs directly. For cases where nodes have different amount of disks and capacity, the operator can create more StatefulSets to use them.

kmova commented 6 years ago

Thanks @msau42 thats a good data point. I will add this into the design document. Along with this I will also gather additional details on usecases where the storage pods need multiple PVs and the expected behaviour when using SPDK to access disks.

humblec commented 6 years ago

@msau42 @kmova IMO there are good amount of use cases where a storage pod need more than one local PV. For e.g# sometimes the storage pod have to keep its own metadata in one pv and other PV for data volumes or serving volume create requests. In other angle, one local PV may not be sufficient to serve all the PVC request comes from the kube user. Atleast in Gluster we support around 1000 Volumes from a 3 node gluster cluster. Just attaching one disk and carving out space from it may not be sufficient.

jarrpa commented 6 years ago

In Gluster's case, also, it would be fairly heavy-weight to have one GlusterFS pod per device on a node. Nevermind that it would also limit per-node scale-out expansion, which is one of the core features of Gluster.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-incubator/external-storage/issues/736#issuecomment-504663276): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.