openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
755 stars 109 forks source link

VolumeSpec value can bloat as cluster size (number of nodes) increases due to allowed_nodes and preferred_nodes #1154

Closed hasethuraman closed 2 years ago

hasethuraman commented 2 years ago

Describe the bug Created a cluster with 4 nodes (1 master and 3 agents) and I see all nodes are added to these allowed_nodes and preferred_nodes. So I think when we increase the cluster size and no topology information that opens up the chance to capture all nodes in these sections. Having said that, when we have x1000's of volumes in such large cluster, this can potentially increase the disk usage of etcd and increase the latency overall.

/namespace/mayastor/control-plane/VolumeSpec/38098332-3acc-4850-874b-a5315acf3dce { "uuid": "38098332-3acc-4850-874b-a5315acf3dce", .... "topology": { "node": { "Explicit": { "allowed_nodes": [ "k8s-agentpool1-40851847-0", "k8s-agentpool1-40851847-1", "k8s-master-40851847-0", "k8s-agentpool1-40851847-2" ], "preferred_nodes": [ "k8s-agentpool1-40851847-2", "k8s-master-40851847-0", "k8s-agentpool1-40851847-0", "k8s-agentpool1-40851847-1" ] } }... }

To Reproduce Steps to reproduce the behavior: I hope the above sample can explain how to do that.

Expected behavior A clear and concise description of what you expected to happen. Should we really capture all the nodes and only capture the nodes where the replicas are present?

Screenshots If applicable, add screenshots to help explain your problem.

OS info (please complete the following information):

Additional context Add any other context about the problem here.

tiagolobocastro commented 2 years ago

@hasethuraman I'm not sure why we're doing this, seems we're conflating accessibility for the application with data placement, unless I'm misunderstanding. I can't think of a reason to keep it so it's probably safe to omit these nodes until we have such need.

hasethuraman commented 2 years ago

Thanks @tiagolobocastro. Please let me know when you have any update on fix and timelines.

I may be probably wrong with this suggestion - instead of omitting the nodes completely, I think the necessary nodes (where replicas) can be there and omit the rest of the nodes in that array. This information may be helpful to the admin to query ETCd/mayastor to know the location of the replicas. Since I am not familiar with mayastor and my suggestion is completely wrong or doesnt add any value here, please ignore this.

hasethuraman commented 2 years ago

I think this would be a better way:

If there is a topology information (user's desire) can be sent to CSI through storageclass or pvc, Mayastor's VolumeSpec can have a new toplogy field to capture this topology-info (which can be consumed in future. for example: a cluster restart, scale-out) and avoid allowed_nodes and preferred_nodes.

If topology information is not provided, then it means all nodes are eligible to provision the replica, still allowed_nodes, preferred_nodes can be [] and topology key will be nil.