Gradual Rollout Support

telekom / das-schiff-network-operator

Configure netlink interfaces, simple eBPF filters and FRR using Kubernetes resources.

Apache License 2.0

27 stars 3 forks source link

Gradual Rollout Support #98

Open chdxD1 opened 4 months ago

chdxD1 commented 4 months ago

In the current design network-operator runs independently from each other. Custom resources are read from the cluster and configured on the local node.

If a user configures a faulty configuration this could render all nodes inoperable at roughly the same time. There is no rollback mechanism in place that reverts faulty configuration or a gradual rollout of configuration.

I propose a 2-step approach:

Combine the availability checking with the configuration step, if e.g. the cluster API server becomes unavailable, the network operator will rollback the configuration on the local node and will not try to roll it out again, unless it was changed.
Gradual rollout support. A central controller is needed which renders custom resources from the cluster to individual node configuration. This individual node configuration is then used by the network operator on each node. If the rollout fails it will mark the resource as failed in the .status part. The central controller renders each node configuration individually (and in a gradual fashion) thus allowing to respond to failed node configurations by stopping any further rollout until the input resources are changed again.

Because connection to the API server might be impacted by a faulty rollout, step two depends on step one. Each network-operator on the node should be individually capable to rollback the configuration to a previous, working state.

For step two: The controller (or K8s API server with ownership relations) should also clean up node configurations (which can be written as a dedicated CRD) when a node leaves the cluster. As we are a heavy user of Cluster API this is required to not clog the API server with unnecessary resources.

p-strusiewiczsurmacki-mobica commented 3 months ago

I assume something like this will be needed network-operator-gradual-rollout-diagram

Network-operator working on control-plane node will be responsible for running reconciliation loop.
When new config is created, Operator sends the config to the worker node using gRPC endpoint.
Network operator that is running on worker node will have permanent storage (volume, directory mount, database etc.) to save last known good config in case a revert is needed. Storage should be permanent in case of pod's death amidst the process.
New configuration is being applied and connectivity is tested ((this is mostly already implemented in healthcheck package I think).
If connectivity test fails, node should restore previous config from the permanent storage.
Node reports the status to the control-plane gRPC endpoint.
If node reported success, next node (or multiple nodes in parallel) can be processed.
If node reported configuration is not working, all previously configured nodes should be reverted.
If node is unable to report the status within some time frame (6), control plane should tag config as invalid and revert other nodes that were already configured.

Other ideas are:

Instead of using separate network-operator-master pod, leader election could be used to elect leader that will perform role of central pod. The problem would be how to revert changes on previously configured pod, if leader will configure itself with invalid config.
Instead of central pod, it might be possible to use lease object to make nodes update one by one. Then all the pods work as usual doing reconciliation etc. but before config is applied, node has to acquire the lease. The problem will occur, if node config will be invalid and node will be unable to recover for some reason - then it will be also unable to tag config as invalid. Also, reverting nodes that were already configured with new config would be cumbersome.

chdxD1 commented 3 months ago

@p-strusiewiczsurmacki-mobica I have a few comments on the steps:

VRFs and L2VNIs are currently not rolled out on controlplane nodes, so it might make sense. However it might be meaningful to have a separate set of controllers (with integrated leader election) that work as "control-plane" for the configuration
I am not so sure about the gRPC endpoint. There are multiple ways to do it of course, one is gRPC the other would be decoupling it by using custom resources for the individual node configuration as well. The pod running on the node would look for changes of its local NodeConfiguration and will report the rollout status in the status fields. If a node disappears (is deleted from the cluster the controller should also cleanup leftover NodeConfigurations) The config could look like this:
```
apiVersion: network.schiff.telekom.de/v1alpha1
kind: NodeConfiguration
metadata:
name: <node-name>
spec:
vrfs: {}
l2vnis: {}
[...]
```
The network operator can save the working configuration on the local disk
yes, mostly implemented there, might be a good idea to check for e.g. API server as well
yes
see remarks regarding status field in 2.
yes
yes
yes

Regarding the other ideas:

Because each node is independent of each other in the rollout and more important rollback itself this shouldn't be too critical. However I would propose a separate set of controllers (with LE), as written before, because the network-operator-master might get config as well in the future.
I don't like this idea for the same reasons you states as well: There might be issues when the connectivity is lost of that node that held the lease.

p-strusiewiczsurmacki-mobica commented 3 months ago

@chdxD1 Just 2 more questions before I'll try to incorporate your comments into this:

Should this: kind: NodeConfiguration resource be created by the user, or should it be created by the controller out of the vrfrouteconfigurations, layer2networkconfigurations and routingtables resources? I assume the latter, but just want to be sure.
Assuming the NodeConfiguration will be created by the controller - let's say user deploys vrfrouteconfigurations with selector that selects e.g. 3 nodes. Controller creates NodeConfiguration for all 3 nodes - 2 of those nodes report success, but one fails. Should we revert the changes on the successful nodes?
In same scenario as in (2) - what if first node out of 3 will fail? Should nodes 2 and 3 be processed?

EDIT: I've made this diagram quickly:

network-opeartor-gradual-rollout-2

All network-operator-config-controller pods watch vrfrouteconfigurations, layer2networkconfigurations and routingtables resources.
Leader is elected.
Leader creates NodeConfiguration for first node worker-1.
worker watches NodeConfiguration object. 4a. It gets new config if available. 4b. After configuring the node it sets status of NodeConfiguration object to valid or invalid.
Store valid config.
Restore valid config if new config is invalid.
Leader watches NodeConfiguration object.
When status of NodeConfiguration object is set, it means that node was processed, so next NodeConfiguration object (for worker-2) is created.
Leader watches Nodes, and removes NodeConfiguration objects if necessary (e.g. node was removed from the cluster).

p-strusiewiczsurmacki-mobica commented 3 months ago

@chdxD1 I've started working on this on Monday. I've created draft PR so you can take a quick look if you'll have some spare time and tell me if I am going in good direction with this.

It is implemented mostly as described in my previous comment. Right now I need to make network-operator-config-controller to wait before it will deploy next config (so, basically to do that gradually), and also some checks to be sure that the NodeConfig was really changed and should be deployed at all.

I have additional question as well - what should we do with the mounted config file? Should it stay the way it is now, or should it also be a part of NodeConfig?

chdxD1 commented 2 months ago

the controller should take care of the node configuration
when a node errors the configuration should be rolled back on all nodes, it doesn't matter if it has worked on the majority or not
if first node fails no further nodes should be processed

Regarding the mounted config file: It might become a CRD instead of a configmap however it will be the same for all nodes.

p-strusiewiczsurmacki-mobica commented 2 months ago

OK, I think I've got most of it by now. Before updating config node configuration controller gets all currently deployed configs and if any node fails, it reverts changes on all the nodes. Currently it just sotre that in memory, but I am thinking if those should not be saved as k8s objects in case of leader change amidst the config update, so new leader can revert he changes if required). I have just one more issue - controller creates NodeConfig using vrfrouteconfigurations, layer2networkconfigurations and routingtables. Now, if NodeConfig is invalid it will get reverted to the last known working config, but controller will still try to use existing vrfrouteconfigurations, layer2networkconfigurations and routingtableswhich will result in invalid config being created and deployed over and over again. The only workaround I can think of is to store last invalid configs as an objects named e.g. <nodename>-invalid (as I currently store current configs named as <nodename>) and then check new config against it and if new config results in one or more invalid configs, then cease its deployment. Eventually those configs could be stored in new CRD named InvalidNodeConfig for better clarity. It's not really clean but it should work, I think. What's your opinion?