openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
731 stars 106 forks source link

topology aware replication #1545

Open elias-dbx opened 10 months ago

elias-dbx commented 10 months ago

Is your feature request related to a problem? Please describe. It would be really useful if mayastor replication could be topology aware outside of just hostnames. Currently it is possible for all replicas of a mayastor volume to live in the same rack/room which makes certain failures take out all volume replicas (for example a rack dying can take down all replicas of a volume despite the volumes being replicated across different hosts).

Describe the solution you'd like

Topology spread

It would be really nice if mayastor allowed users to specify a topology key to spread volume replicas over. For example, if I labeled all my kubelets with rack names:

rack=rack1
rack=rack2
rack=rack3

It would be really nice to tell mayastor, spread replicas over the "rack" topology key. Ideally this would be a parameter of the storage class so different storage classes could provide different guarantees of durability. A "topologySpreadKey" could be added to the mayastor storage class model.

Create a 3x replicated storage class durable to rack failures:

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-3-rack-diverse
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "3"
  topologySpreadKey: "rack"
provisioner: io.openebs.csi-mayastor
EOF

Create a 3x replicated storage class durable to host failures (the current behavior):

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-3-host-diverse
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "3"
  topologySpreadKey: "kubernetes.io/hostname"
provisioner: io.openebs.csi-mayastor
EOF

Topology affinity

Similarly, it could be really useful to specify topology affinity to allow colocating replicas within the same topology for performance reasons. For example, we may want to colocate replicas on the same rack as a tradeoff for better replication performance at a tradeoff for durability.

We could similarly have a topology affinity key in the storage class:

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-3-rack-affinity
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "3"
  topologyAffinityKey: "rack"
provisioner: io.openebs.csi-mayastor
EOF

Describe alternatives you've considered I don't think there are any alternatives right now

Additional context

sinhaashish commented 10 months ago

Hi @elias-dbx We are presently working on this requirement. The solution which we are working on is adding labelled field in the disk pool. The disk pool will help to allow colocating replicas within the same topology for performance reasons. We can have key and value under topology like type: hot or classification: premium.

apiVersion: openebs.io/v1alpha1
kind: DiskPool
metadata:
  name: diskpool-ksnk5
spec:
  topology:
    labelled:
      user_defined_key: user_defined_value
  disks:
  - /dev/sdc
  node: node-a

The user can have storage class specifying the topology labels

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: mayastor-1
parameters:
  repl: '1'
  protocol: 'nvmf'
  ioTimeout: '60'
  pool_topology_inclusion : '{"user_defined_key": "user_defined_value"}
provisioner: io.openebs.csi-mayastor
volumeBindingMode: Immediate

The replicas of volume will be created on the pools which match the topology parameters.

With this approach user will be able to select the pools where the replicas of the volume will be placed.

This feature will be available soon.

elias-dbx commented 10 months ago

Thank you for your response @sinhaashish.

The solution you are working on appears to solve the problem of co-locating replicas within the same topology, but it does not solve the durability problem of potentially having all replicas running in the same topology. Am I correct?

From a user perspective the proposed solution would be quite cumbersome for maintaining large clusters. For example if I am managing many racks (say 100), I would need 100 different storage classes each pinning to a different rack given the solution you proposed.

For my use case (spreading replicas or colocating replicas) I don't so much care which specific node the replicas are put on, only that the topology constraints are satisfied.

From what I gather the solution you are working on, it would be more useful for things like pinning the replicas to volumes with certain storage characteristics -- maybe you have "fast" NVME devices and "slower" NVME devices in the cluster and you want the storage class to pick. My issue of topology spread/colocation seems to be a separate issue which is not solved by the preposed pool_topology_inclusion.

Given a Pod scheduling analog, pool_topology_inclusion is like a nodeSelector whereas what I am looking for is something closer to topologyspreadconstraints.

tiagolobocastro commented 10 months ago

For this I think the exclusion labels (not in use for provisioning yet) would work here, example: pin the replicas to pools with "zone" but not allow more than 1 replica on the same zone. We'll take this onboard, and come back to you with an update.

Another interesting thing might be to also pull specific labels (eg: openebs.io/topology/rack: abc) from the k8s node labels and add them to the pool spec automatically, though this might be more controversial since we try to avoid touching the pool cr spec as that is user defined only..

elias-dbx commented 10 months ago

yea, it would be interesting to pull k8s node labels into pool automatically but that is mostly a nice to have and it isn't very hard to write automation for that. It could also make migrations harder since the host labels would always have to match pool labels.

thanks again for the reply, let me know if you want to sync up or bounce ideas. more generally the problem I am trying to solve is replica affinity and anti affinity for racks in a scalable way (works for 100s of racks). we don't care what specific rack we have affinity or anti affinity for we just want the replicas to either:

sinhaashish commented 10 months ago

@elias-dbx as @tiagolobocastro suggested , we have the provision of exclusion labels, though it is not exposed to users yet. Your use case can be probably solved using this.

Please reply if this fits your use case.

elias-dbx commented 10 months ago

Hi @sinhaashish . Looking at the code ExclusiveLabel does appear to do what we want to do for spreading volumes across topology (or 'Zones' as they are called in mayastor).

Is this comment for InclusiveLabel correct? https://github.com/openebs/mayastor-control-plane/blob/be91751ed9ed841501ddd9a53286252fc6b97a75/control-plane/stor-port/src/types/v0/transport/volume.rs#L309 I would have assumed that the InclusiveLabel would act as the complement of ExclusiveLabel?

In the example given if my InclusiveLabel is Zone and I have a resource with Zone: A I would expect it to be paired up with another resource in Zone: A, but the comment seems to suggest otherwise?

/// if label is "Zone": /// A resource with "Zone: A" would be paired up with a resource with "Zone: B", /// but not with a resource with "OtherLabel: B"

sinhaashish commented 10 months ago

@elias-dbx the comment for InclusiveLabel is :

/// Includes resources with the same $label or $label:$value eg:
/// if label is "Zone: A":
/// A resource with "Zone: A" would be paired up with a resource with "Zone: A",
/// but not with a resource with "Zone: B"
/// if label is "Zone":
/// A resource with "Zone: A" would be paired up with a resource with "Zone: B",
/// but not with a resource with "OtherLabel: B"
/// inclusive label key value in the form "NAME: VALUE"

to your question

In the example given if my InclusiveLabel is Zone and I have a resource with Zone: A I would expect it to be paired up with another resource in Zone: A, but the comment seems to suggest otherwise? You need to have InclusiveLabel as Zone: A, if you expect it to be paired up with another resource in Zone: A.

If you just have the label without any value like just Zone, it can match with any value corresponding to label. So it can match with Zone:A or Zone:B or Zone:C. but it will not match with the any other labels .

Like inclusiveLabel is Zone wont match with other inclusiveLabel as teritory.

sinhaashish commented 10 months ago

Hi @sinhaashish . Looking at the code ExclusiveLabel does appear to do what we want to do for spreading volumes across topology (or 'Zones' as they are called in mayastor).

If this works for you , then we can incorporate this change in upcoming releases.

elias-dbx commented 10 months ago

Yea exposing ExclusiveLabel as a parameter on the storage class should work for spreading volume replicas across zones from my understanding.

The InclusiveLabel doesn't really do what I would like. Ideally I would like to define at the storage class level that all replicas for the same volume have the same label value, but I don't care what that label value is.

For example say I have 100 racks (rack0, rack1, rack2, ... rack99) of compute and I want to define a storage class which colocates volume replicas within the same rack because that rack has a switch which allows very fast interconnect between nodes on that rack. With my understanding of Inclusivelabel I would have to create a storage class for each and every rack to colocate replicas on those racks.

Going back to my previous example this is how I would imagine the storage class would look like for colocating volume replicas on the same zone (rack):

 cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-3-rack-affinity
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "3"
  topologyAffinityKey: "rack"
provisioner: io.openebs.csi-mayastor
EOF

but if topologyAffinityKey maps to Inclusivelabel the replicas could be placed in any pool with the rack label. wheareas we want to have replicas be placed in any pool which share the same value of the rack label.

sinhaashish commented 9 months ago

Hi @elias-dbx We have discussed the approach and finalised the below.

The Storage class will have nodeTopologyAffinity, nodeTopologySpread and nodeInclusion , This will decide the node topology.

For pool we need to have poolTopologyAffinity and poolTopologySpread and these labels will be present in the disk pool.

nodeTopologyAffinity: "rack: A"  <- key: val it's the same as our current inclusion list!
replica1: "rack: A"
replica2: "rack: A"

nodeTopologyAffinity: "rack"
replica1: "rack: A"    <- A determined at creation time
replica2: "rack: A"

nodeInclusion: "rack"
replica1: "rack: A" | "rack: B" <- If replica1 goes to rack:A the replica2 goes to rack: A 
replica2: "rack: A" | "rack: B" <- If replica1 goes to rack:B the replica2 goes to rack: B

nodeTopologySpread: "rack"  <- exclusion list
replica1: "rack: A"
replica2: "rack: B"

poolTopologyAffinity: "performance: Gold"
replica1: "performance: Gold"
replica2: "performance: Gold"

poolTopologySpread: "storage: Hot". 
replica1: "storage: Cold"
replica2: "storage: Warm"
elias-dbx commented 9 months ago

that looks perfect!

iceman91176 commented 3 months ago

Hi, any news about this ? When can expect this feature ? Thank you.

orville-wright commented 3 months ago

Hi @iceman91176 I run Product Mgmt and Strategy for openEBS (See Maintainers Leadership team here: https://github.com/openebs/community?tab=readme-ov-file#project-leadership-team).

We've just frozen the v4.1 dev train and are building the v4.1 RC now. The first implementation of the Topology Aware Replication feature is planed for the v4.1 release. Thanks to great design feedback from @elias-dbx. We hope to drop v4.1 in the next couple of weeks. So... you should see it very soon. No guarantees on a final date, but the RC is being built right now.