Open p-strusiewiczsurmacki-mobica opened 3 months ago
@chdxD1
Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?
If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that.
would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop
@chdxD1
I had to make tons of changes, but I think the code should be much better now.
Each config now has owner reference set to node, so whenever a node is removed, all the NodeConfig
objects should be removed automatically as well.
As for nodes added, I've introduced a node_reconciler
which watches for nodes and whenever node is added it sends info to config_manager
so it can trigger updates as soon as possible. It also tags deleted nodes as 'inactive' so those can be skipped if for example node was deleted during the config deployment process.
This PR implements gradual rollout as described in #98
There are 2 new CRDs added:
NodeConfig
- which represents node configurationNodeConfigProcess
- which represents the global state of the configuration (can beprovisioning
orprovisioned
). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.New pod added -
network-operator-configurator
- this pod (daemonset) is is responsible for fetchingvrfrouteconfigurations
,layer2networkconfigurations
androutingtables
and combining those intoNodeConfig
for each node.network-operator-worker
pod instead of fetching separate config resources, will now only fetchNodeConfig
. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.For each node there can be 3 NodeConfig objects created:
<nodename>
- current configuration<nodename>-backup
- backup configuration<nodename>-invalid
- last known invalid configurationHow does it work:
network-operator-configurator
starts and leader election takes place.if any config is inNodeConfigProcess
invalid
orprovisioning
state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.vrfrouteconfigurations
,layer2networkconfigurations
and /orroutingtables
object,configurator
configurator will:NodeConfig
for each node- setNodeConfigProcess
state toprovisioning
provisioning
.network-operator-worker
fetches new config and configures node. It checks connectivity and:provisioned
invalid
.provisioned
- it proceeds with deploying next node(s).invalid
- it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using
update-limit
configuration flag (defaults to 1).