Open fara-tode opened 1 week ago
Hi @fara-tode, I think indeed pinning of the volume is a good idea for single replica volumes.
One thing to consider though, is that these single-replica volumes may be moved to another node, and in this case the pinning is not desirable as would prevent the application from starting on another node.
A custom scheduler perhaps is what's needed here, to ensure application is placed together with the storage?
Probably a very bad idea, but just throwing it out there: adding a label to the nodes for every volume with a replica living there, and then we'd simply move the label to another node, thus achieving a flexible pinning. Obvious downside is lots of labels on the nodes...
But IIRC the main issue with pinning was with multi-replica volumes, because during rebuilds, replicas may move to other nodes, thus negating the pinning and yet constraining the application. Perhaps we bring pinning as it was, as an opt-in but only allowed for single replica volumes. And also disable failover for these volumes.
Not really understand that, you meant that even it is initial volume is created on node1 it can move by its own to node2 ? If so the expected is that it won't in single replica bind node.
will appreciate that :)
Futhermore i see that im achieving about 6x more iops if my pod and volume are running on same node. So that also is additional point for having this bind.
Not really understand that, you meant that even it is initial volume is created on node1 it can move by its own to node2 ? If so the expected is that it won't in single replica bind node.
You can move it, by temporarily increasing the replica count and then decreasing it, though it's not that easy since today there's no way of telling which replica we prefer to remove, but this will be possible with pool cordon and pool drain.
will appreciate that :)
:+1:
Futhermore i see that im achieving about 6x more iops if my pod and volume are running on same node. So that also is additional point for having this bind.
Also btw, ublk is on the roadmap, that will also help massively with both single replica performance and cpu reduction because you won't nvme-tcp from the initiator.
thanks for input. Waiting for bring back 'local' parameter, then
Description
When setting pod over mayastor storage class with 1 replica, I want ensure that pod is scheduled and run on same node as the pv is.
in theory i should be able to achieve this by defining storagecass: volumeBindingMode: WaitForFirstConsumer which im using already. But that seems to not really work for mayastor. I have multiple pod that are running on different nodes than their mayastor PVs.
There was parameter local in past which seems that did what is expected now, but the local parameter is deprecated now:
I don't see that as limitation is that application pod will be scheduled only on the node that is hosting the actual volume. I see it rather as expected as an optimal way running applications when you dont have dedicated storage k8s nodes, but want to just run an app on same node as data is.
Context
In situation that with having 2 k8s nodes, and 2 pods that running replicated application. Both pods are sts but not in same stsgroup. sts1-0 was running on node1 with mayastor pv on node2 sts2-0 was running on node2 with mayastor pv on node1 when any node die both 2 pods/sts are not working.
Possible Solution
provide back the local parameter?
Screenshots