Currently we do not really support autoscaling clusters that are managed by Stackable beyond fairly trivial scenarios.
We do support deploying HPAs for scaling StatefulSets, but as soon as soon as actions need to be run before or after scaling these, we start running into issues. This is due to the fact, that there exist no pre- or post-scale hooks that operators could tie into to
trigger scaling workflows
delay actual scaling until workflows have finished
Actual example
When scaling down a NiFi cluster, a node has to be decommissioned and offloaded of flowfiles before tearing down the node, otherwise any data on that node will be lost forever.
During the recent on-site in Munich I presented the current state and an idea of how this could be implemented, slides used: Autoscaling @ Stackable.pdf
The scope of this ticket is to get the fundamental work in operator-rs done, so that functionality exists that operators can call to determine if scaling is neccessary and give appropriate control to the operator to run workloads before or after actually doing the scale function.
Ideally we'll also implement helper functions for the actual scaling, i.e. propagating replica count to the StatefulSets, when appropriate.
Requirements
[M] Allow us to control the scaling / run code before scaling happens
delay scaling until we are done!
[S] Use existing scaling mechanisms
[S] Disallow ambiguous configuration (op and scaler fighting with each other)
[C] Isolate as much as possible into operator-rs and have operators just define necessary scaling steps
Currently we do not really support autoscaling clusters that are managed by Stackable beyond fairly trivial scenarios.
We do support deploying HPAs for scaling StatefulSets, but as soon as soon as actions need to be run before or after scaling these, we start running into issues. This is due to the fact, that there exist no pre- or post-scale hooks that operators could tie into to
Actual example When scaling down a NiFi cluster, a node has to be decommissioned and offloaded of flowfiles before tearing down the node, otherwise any data on that node will be lost forever.
During the recent on-site in Munich I presented the current state and an idea of how this could be implemented, slides used: Autoscaling @ Stackable.pdf
The scope of this ticket is to get the fundamental work in operator-rs done, so that functionality exists that operators can call to determine if scaling is neccessary and give appropriate control to the operator to run workloads before or after actually doing the scale function. Ideally we'll also implement helper functions for the actual scaling, i.e. propagating replica count to the StatefulSets, when appropriate.
Requirements