Prevent race condition on multiple Update operations

thekad commented 4 years ago

As an ARO cluster operator, I decide I need to scale up/down my cluster compute nodes horizontally/vertically, I submit a cli call az openshift update --refresh-cluster, but my terminal crashes and I have no idea if the call went through, so I submit another update call and it also goes through. At the end of process I am left with an inconsistent number of nodes from different VMSS. Ideally, the RP should be smart enough to return an error code if a cluster update is already running, instead of having two different process fighting to finish draining/deleting nodes and scaling up/down scale sets.

thekad commented 4 years ago

This brought up by a recent customer incident exhibiting this behavior, entirely likely it can happen again. cc @ehashman @jim-minter

/priority important-soon

jim-minter commented 4 years ago

I think this is definitely an issue, but I think it's an RP bug and not a plugin bug.

ehashman commented 4 years ago

Agree with Jim. I'm going to clear the priority tag; we can figure out how to prioritize this with the rest of the 3.11 work.

ehashman commented 4 years ago

Confirmed this is tracked in VSTS.

openshift / openshift-azure

Prevent race condition on multiple Update operations #2286