At the moment, if a create_nodegroup and update_nodegroup happens at the same time, we end up locking the cluster resource for quite a while, enough to actually cause a fault:
2024-05-16 18:11:46.093 29 ERROR oslo_messaging.rpc.server sherlock.lock.LockTimeoutException: Timeout elapsed after 10 seconds while trying to acquiring lock.
I guess that CAPI is taking too long to mutate and causes the timeout, because for the rest, it's just a simple quick API call, so I think we need to look at this code closer:
Those two need to be more efficient somehow, locking for 10 seconds doesn't seem to be the ideal thing. Perhaps we just "trust" that CAPI is doing the right thing and move to in progress.. but then in the cluster status update, we need to somehow figure out if there has been a recent change or not and wait there instead.
At the moment, if a
create_nodegroup
andupdate_nodegroup
happens at the same time, we end up locking the cluster resource for quite a while, enough to actually cause a fault:I guess that CAPI is taking too long to mutate and causes the timeout, because for the rest, it's just a simple quick API call, so I think we need to look at this code closer:
https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L356-L364 https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L457-L467
Those two need to be more efficient somehow, locking for 10 seconds doesn't seem to be the ideal thing. Perhaps we just "trust" that CAPI is doing the right thing and move to in progress.. but then in the cluster status update, we need to somehow figure out if there has been a recent change or not and wait there instead.