gluster / glusterd2

[DEPRECATED] Glusterd2 is the distributed management framework to be used for GlusterFS.
GNU General Public License v2.0
167 stars 82 forks source link

Device management APIs #728

Open aravindavk opened 6 years ago

aravindavk commented 6 years ago

Current implementation of device management is very hacky, since the device details are stored in Peer.Metadata we are restricted to store it as string. Due to which, device add/edit/delete involves multiple Marshal/Unmarshal of device details.

Following changes are required in our device management

If we move to different namespace only issue is cleanup of the devices belonging to a peer on peer detach. That can be addressed using Peer detatch event.

store-utils:

APIs required,

Following APIs not required now, but good to have for debugging or for monitoring

phlogistonjohn commented 6 years ago

I've started looking at the parts of GD2 related to the higher-level volume management and related things. Although I've haven't had much time to look at the current code in GD2. In light of plans to merge heketi features into GD2 I think it's important to compare and contrast what heketi currently does and what GD2 has or plans to do. So I'm happy to see this topic brought up independently.

Heketi master currently has the following apis around devices:

Device state online means device is valid for use in new volumes. Device state offline means device is not available for new volumes but does not impact old volumes. Device state failed attempts to remove all bricks from the device (migrating them to other devices). This is some of the most complicated code in heketi.

Device resync was added in order to help users correct heketi db when the underlying storage sizes changed or got out of sync.

Heketi recently added an "operations" framework partly to deal with synchronization issues with gluster but items like LVs, mounts, etc are going to continue to be independent of a GD2+heketi system. I've read up a bit on the GD2 transaction framework but I don't yet see if it can handle this. Consider the following: a user requests storage on a given device (perhaps as part of a higher level api), and GD2 starts creating LVM devices and formatting them but then is rebooted during the process. Heketi is attempting to deal with this by logging the pending items in the db (in a transaction) before making changes to the system. That way when we come back we can clean up or continue the partly done items (this auto-cleanup feature is still in development).

I'd love to hear how this model might work with the GD2 transaction approach. I'd also like to hear any thoughts about updating multiple items in the db in one (db style) transaction. How would this work with etcd the ways it is used by GD2?

One other thing I think we should keep in mind is that in some environments the device can be moved from one node to another. In some cloud environments or SAN environments it could be useful to eventually be able to track the device semi independently of the node. That's another good reason not to track devices as part of the nodes, and maybe a reason to identify the devices in node independent ways (uuid for example).

So much of heketi's logic is around device management vs. the smaller part that is volume provisioning I don't think we can discuss one topic without the other! :-)

phlogistonjohn commented 6 years ago

Sorry for the wall of text! :-D

aravindavk commented 6 years ago

Thanks for the comments.

I've started looking at the parts of GD2 related to the higher-level volume management and related things. Although I've haven't had much time to look at the current code in GD2. In light of plans to merge heketi features into GD2 I think it's important to compare and contrast what heketi currently does and what GD2 has or plans to do. So I'm happy to see this topic brought up independently.

Part of device management code is already available in Gd2, I opened this issue to address the limitations I found during the integration with Intelligent Volume provisioning work.

Heketi master currently has the following apis around devices:

device add: add a new device

Already available in Gd2, a few validations and different structure is proposed in this issue.(Currently stored as part of Peer object)

device state set (states: online, offline, failed): change the availability of the device for other operations (see below)

This looks interesting, changing the device availability for maintenance work

device remove: remove the device

Yet to be implemented

device resync: compare heketi db with device state on nodes and update heketi

Yet to be implemented

device info: fetch device information

Yet to be implemented

update device tags (metadata strings): tags are user controlled metadata that can be used during operations like volume create for brick placement

Nice. Peer and Volume Metadata patches are worked on. We can add Metadata to devices as well once code is re-factored to store devices separately(Currently stored as part of peer object)

Device state online means device is valid for use in new volumes. Device state offline means device is not available for new volumes but does not impact old volumes. Device state failed attempts to remove all bricks from the device (migrating them to other devices). This is some of the most complicated code in heketi.

+1, Liked this feature.

Device resync was added in order to help users correct heketi db when the underlying storage sizes changed or got out of sync.

Heketi recently added an "operations" framework partly to deal with synchronization issues with gluster but items like LVs, mounts, etc are going to continue to be independent of a GD2+heketi system. I've read up a bit on the GD2 transaction framework but I don't yet see if it can handle this. Consider the following: a user requests storage on a given device (perhaps as part of a higher level api), and GD2 starts creating LVM devices and formatting them but then is rebooted during the process. Heketi is attempting to deal with this by logging the pending items in the db (in a transaction) before making changes to the system. That way when we come back we can clean up or continue the partly done items (this auto-cleanup feature is still in development).

Compared to Heketi, Gd2 has more control on the internal state and data structures. Device status, peer info and volume info are stored in etcd.

Heketi REST API is asynchronous by default, so it needs the replay/cleanup mechanism on reboot. Since GD2 REST API is synchronous, application will get error(connection lost) on reboot or if glusterd2 goes down. Transaction details are stored in temporary namespace in etcd, once Transaction completes successfully data will be written to permanent namespace in etcd. But cleanup of half done tasks is yet to be implemented in Gd2.

I'd love to hear how this model might work with the GD2 transaction approach. I'd also like to hear any thoughts about updating multiple items in the db in one (db style) transaction. How would this work with etcd the ways it is used by GD2?

etcd is distributed key value store, all information related to an key(Object in Gd2. Peer, volume, brick etc) can be saved in single call.

One other thing I think we should keep in mind is that in some environments the device can be moved from one node to another. In some cloud environments or SAN environments it could be useful to eventually be able to track the device semi independently of the node. That's another good reason not to track devices as part of the nodes, and maybe a reason to identify the devices in node independent ways (uuid for example).

I think we need separate API for this use case.

So much of heketi's logic is around device management vs. the smaller part that is volume provisioning I don't think we can discuss one topic without the other! :-)

phlogistonjohn commented 6 years ago

Compared to Heketi, Gd2 has more control on the internal state and data structures. Device status, peer info and volume info are stored in etcd.

Sure. This is one of the reasons why we're discussing merging the featuresets. For all gluster specific metadata it should be stored once in the GD2/etcd db.

Heketi REST API is asynchronous by default, so it needs the replay/cleanup mechanism on reboot. Since GD2 REST API is synchronous, application will get error(connection lost) on reboot or if glusterd2 goes down.

I don't think the way the client responses are delivered here is the main issue. What I'm focused on is not the response to the client but the intermediate state in the system (one or all nodes). The issue is that not all the things we need do are naturally atomic. We don't have the ability to create an LV and format it with xfs atomically for example.

Transaction details are stored in temporary namespace in etcd, once Transaction completes successfully data will be written to permanent namespace in etcd. But cleanup of half done tasks is yet to be implemented in Gd2.

Ah! This is very helpful. In Heketi we chose a different approach but had we discussed using a separate set of buckets (close enough to a namespace for our purposes).

So, in theory we could pre-record all the needed tracking objects in this temporary namespace and on success it gets atomically swapped into the other namespace and on failure or restart-with-dirty-namespace the items in the system could be cleaned up from the tracking items in the temp namespace before the temp namespace is cleaned.

Older versions of Heketi really suffer from the idea that Go's defer statements would be able to catch errors and automatically clean up these intermediate items. It was not good at handling the fact that a process can terminate at any particular line of code and so these defer statements would never run. I am a bit crazed right now about the idea that we really need to be robust against events like that and I want to understand what steps the current GD2 architecture takes to deal with unexpected termination and/or what could be done in the future. And FWIW these terminations appear to happen a lot more frequently in the containerized/kubernetes world.

I'd love to hear how this model might work with the GD2 transaction approach. I'd also like to hear any thoughts about updating multiple items in the db in one (db style) transaction. How would this work with etcd the ways it is used by GD2?

etcd is distributed key value store, all information related to an key(Object in Gd2. Peer, volume, brick etc) can be saved in single call.

Sure. I've read over the etcd v3 api specs, even if I haven't got much hands on experience with it yet. What I was curious about is what approach you'd expect to take in the case of a hypothetical like the following:

We have a volume V with bricks, A, B, C on devices X, Y, Z respectively. The user wants to replace device X so we need to create a brick D on device W and update volume V. We'd want to make sure that once we had decided to place brick D on W that the space available for other, concurrent, requests is reduced.

The database layer heketi uses supports read and write transactions, so I know that if I update V, A, D, or X it will all happen together. I am curious what technique GD2 would take to make a batch of changes atomically like that.

aravindavk commented 6 years ago

I don't think the way the client responses are delivered here is the main issue. What I'm focused on is not the response to the client but the intermediate state in the system (one or all nodes). The issue is that not all the things we need do are naturally atomic. We don't have the ability to create an LV and format it with xfs atomically for example.

Noted. This is a generic problem applicable for all the APIs provided by Glusterd2. I will open new issue for the same.

The database layer heketi uses supports read and write transactions, so I know that if I update V, A, D, or X it will all happen together. I am curious what technique GD2 would take to make a batch of changes atomically like that.

Transaction in Gd2 is list of steps with two functions Do and Undo(Each step can be restricted to run in specific nodes). Only Do function will be called during Transaction, if fails, Undo function will be called in reverse order. I think @kshlm can provide more details about this framework.

aravindavk commented 6 years ago

One more addition to API required,

aravindavk commented 6 years ago

Not an immediate requirement, good to have Metadata support for device as well(Similar to Peer and Volume).

Usecase:

rishubhjain commented 6 years ago

@aravindavk Do we need to allow user to edit device information(some of them) after adding device? Enabling/Disabling can be a part of EditDeviceAPI if in case we want to make it generic, else we could continue with EditDeviceState (https://github.com/gluster/glusterd2/pull/810) API

aravindavk commented 6 years ago

@aravindavk Do we need to allow user to edit device information(some of them) after adding device? Enabling/Disabling can be a part of EditDeviceAPI if in case we want to make it generic, else we could continue with EditDeviceState (#810) API

May be Metadata of device, but for now we can limit to enable/disable in EditDeviceAPI

atinmu commented 6 years ago

@rishubhjain Can't this be targeted for GCS-Sprint1?

rishubhjain commented 6 years ago

@atinmu Since all the important pieces related to IVP are already in, we can take this up in the next sprint and focus on other blockers in this sprint (GCS-Sprint1).

atinmu commented 5 years ago

Remove device APi isn't a blocker for the rescoped MVP of GCS/1.0