Open srteam2020 opened 2 years ago
Duplicate with https://github.com/konpyutaika/nifikop/issues/49.
Hi @juldrixx Thanks for the reply.
This issue is kinda similar to #49 as both of them are triggered by a crash at a particular point. However, we believe they are different issues and should be handled in different ways.
First, the triggerings are different. #49 happens when a crash happens between (1) updating the config object and (2) setting ConfigOutOfSync
in Reconcile()
in resource.go
. But this issue is triggered when a crash happens between (1) setting ConfigInSync
and (2) setting GracefulUpscaleSucceeded
in reconcileNifiPod ()
in nifi.go
.
Second, the consequences are different. For #49, once the issue is triggered nifikop cannot successfully restart the pod to load the new configuration. For this isues, once triggered nifikop cannot scale down the nifi cluster.
Regarding the fix, from a high-level both of them could be fixed by carefully changing the order of certain updates. But the concrete fix would be different as the triggerings are different.
Bug Report
We find that nifikop will never be able to scale down the nificluster successfully if it crashes in the middle of
reconcileNifiPod()
and later restarts.More concretely, inside
reconcileNifiPod()
, nifikop does the following:status.nodesState[nodeId].configurationState
of the nificluster cr toConfigInSync
status.nodesState[nodeId].gracefulActionState.actionState
of the nificluster cr toGracefulUpscaleSucceeded
If nifikop crashes between 3 and 4 and later restarts, it results in an intermediate state, where the nifi pod is created (with
ConfigInSync
) but the correspondingactionState
is not set. Note that given the pod already exists, nifikop will not run the above steps 2, 3, 4 again.Later if the user wants to scale down the nificluster, this pod is supposed to be offloaded and deleted gracefully. Inside
reconcileNifiPodDelete
, nifikop checks whether the correspondingactionState
of the pod isGracefulUpscaleSucceeded
orGracefulUpscaleRequired
. If so, it will add the pod tonodesPendingGracefulDownscale
and later offload and delete the nifi node (pod). However, since the correspondingactionState
is not set due to the previous crash, the graceful downscale will never happen.What did you do? Scale down a nificluster from 2 nodes to 1 node.
What did you expect to see? The second nifi pod should be deleted successfully.
What did you see instead? Under which circumstances? The second nifi pod never gets deleted.
Environment
Possible Solution One potential solution is to switch the order of 3 (set
configurationState
toConfigInSync
) and 4 (setactionState
toGracefulUpscaleSucceeded
). If nifikop crashes beforeConfigInSync
is set,reconcileNifiPod()
will later deletes and recreates the pod.Additional context We are willing to help fix the bug. The bug is automatically found by our tool Sieve: https://github.com/sieve-project/sieve