Gradual rollout - Githubissues

p-strusiewiczsurmacki-mobica commented 3 months ago

This PR implements gradual rollout as described in #98

There are 2 new CRDs added:

NodeConfig - which represents node configuration NodeConfigProcess - which represents the global state of the configuration (can be provisioning or provisioned). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.

New pod added - network-operator-configurator - this pod (daemonset) is is responsible for fetching vrfrouteconfigurations, layer2networkconfigurations and routingtables and combining those into NodeConfig for each node.

network-operator-worker pod instead of fetching separate config resources, will now only fetch NodeConfig. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.

For each node there can be 3 NodeConfig objects created:

<nodename> - current configuration
<nodename>-backup - backup configuration
<nodename>-invalid - last known invalid configuration

How does it work:

network-operator-configurator starts and leader election takes place.
Leader checks ~~NodeConfigProcess~~ if any config is in invalid or provisioning state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.
When user deploys vrfrouteconfigurations, layer2networkconfigurations and /or routingtables object, configurator configurator will:
- combine those into separate NodeConfig for each node ~~- set NodeConfigProcess state to provisioning~~
Configurator checks new configs against known invalid configs. If any new config is equal to at least one known invalid config, deployment is aborted.
Configurator backups the current config as -backup and deploys new config with status provisioning.
network-operator-worker fetches new config and configures node. It checks connectivity and:
- if it is OK, it stores backup on disk, ant updates the status of the config to provisioned
- if connectivity was lost, it restores the configuration from local backup and (if possible) updates the config status to invalid.
Configurator waits for the outcome of the config provisioning by checking the config status:
- if status was set by worker to provisioned - it proceeds with deploying next node(s).
- if status was set to invalid - it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.
- if it times out (e.g. node was unable to update the config state for some reason) - it invalidates the config and reverts the changes on all nodes.

Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using update-limit configuration flag (defaults to 1).

p-strusiewiczsurmacki-mobica commented 2 months ago

@chdxD1

Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?

If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that.

chdxD1 commented 2 months ago

https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90

would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop

p-strusiewiczsurmacki-mobica commented 2 months ago

@chdxD1 I had to make tons of changes, but I think the code should be much better now. Each config now has owner reference set to node, so whenever a node is removed, all the NodeConfig objects should be removed automatically as well. As for nodes added, I've introduced a node_reconciler which watches for nodes and whenever node is added it sends info to config_manager so it can trigger updates as soon as possible. It also tags deleted nodes as 'inactive' so those can be skipped if for example node was deleted during the config deployment process.

telekom / das-schiff-network-operator

Gradual rollout #110