fara-tode commented 1 week ago

Description

When setting pod over mayastor storage class with 1 replica, I want ensure that pod is scheduled and run on same node as the pv is.

in theory i should be able to achieve this by defining storagecass: volumeBindingMode: WaitForFirstConsumer which im using already. But that seems to not really work for mayastor. I have multiple pod that are running on different nodes than their mayastor PVs.

There was parameter local in past which seems that did what is expected now, but the local parameter is deprecated now:

"local"
A flag of the type Boolean, with the default value of true. The flag controls the scheduling behaviour of nexus and replicas to storage nodes in the cluster.

All the following values are interpreted as "true" and anything else as "false": 'y', 'Y', 'yes', 'Yes', 'YES', 'true', 'True', 'TRUE', 'on', 'On', 'ON'.

This value must be set to true for correct operation of Mayastor provisioning and publishing. It is recommended that the volumeBindingMode in storage class be set to WaitForFirstConsumer. This limitation will be removed in a future release
A consequence of the above limitation is that applications pods which use Mayastor provisioned PVs may only be scheduled on nodes which are running Mayastor pods (i.e. data engine container). That is to say, only on MSNs. This limitaion will be removed in a future release.

I don't see that as limitation is that application pod will be scheduled only on the node that is hosting the actual volume. I see it rather as expected as an optimal way running applications when you dont have dedicated storage k8s nodes, but want to just run an app on same node as data is.

Context

In situation that with having 2 k8s nodes, and 2 pods that running replicated application. Both pods are sts but not in same stsgroup. sts1-0 was running on node1 with mayastor pv on node2 sts2-0 was running on node2 with mayastor pv on node1 when any node die both 2 pods/sts are not working.

Possible Solution

provide back the local parameter?

Screenshots

tiagolobocastro commented 1 week ago

Hi @fara-tode, I think indeed pinning of the volume is a good idea for single replica volumes.

One thing to consider though, is that these single-replica volumes may be moved to another node, and in this case the pinning is not desirable as would prevent the application from starting on another node.

A custom scheduler perhaps is what's needed here, to ensure application is placed together with the storage?

Probably a very bad idea, but just throwing it out there: adding a label to the nodes for every volume with a replica living there, and then we'd simply move the label to another node, thus achieving a flexible pinning. Obvious downside is lots of labels on the nodes...

But IIRC the main issue with pinning was with multi-replica volumes, because during rebuilds, replicas may move to other nodes, thus negating the pinning and yet constraining the application. Perhaps we bring pinning as it was, as an opt-in but only allowed for single replica volumes. And also disable failover for these volumes.

todeb commented 1 week ago

One thing to consider though, is that these single-replica volumes may be moved to another node, and in this case the pinning is not desirable as would prevent the application from starting on another node

Not really understand that, you meant that even it is initial volume is created on node1 it can move by its own to node2 ? If so the expected is that it won't in single replica bind node.

Perhaps we bring pinning as it was, as an opt-in but only allowed for single replica volumes. And also disable failover for these volumes.

will appreciate that :)

Futhermore i see that im achieving about 6x more iops if my pod and volume are running on same node. So that also is additional point for having this bind.

tiagolobocastro commented 1 week ago

Not really understand that, you meant that even it is initial volume is created on node1 it can move by its own to node2 ? If so the expected is that it won't in single replica bind node.

You can move it, by temporarily increasing the replica count and then decreasing it, though it's not that easy since today there's no way of telling which replica we prefer to remove, but this will be possible with pool cordon and pool drain.

will appreciate that :)

:+1:

Futhermore i see that im achieving about 6x more iops if my pod and volume are running on same node. So that also is additional point for having this bind.

Also btw, ublk is on the roadmap, that will also help massively with both single replica performance and cpu reduction because you won't nvme-tcp from the initiator.