vexxhost / magnum-cluster-api

Cluster API driver for OpenStack Magnum
Apache License 2.0
41 stars 16 forks source link

Reduce lock time for operations #374

Closed mnaser closed 6 days ago

mnaser commented 1 month ago

At the moment, if a create_nodegroup and update_nodegroup happens at the same time, we end up locking the cluster resource for quite a while, enough to actually cause a fault:

2024-05-16 18:11:46.093 29 ERROR oslo_messaging.rpc.server sherlock.lock.LockTimeoutException: Timeout elapsed after 10 seconds while trying to acquiring lock.

I guess that CAPI is taking too long to mutate and causes the timeout, because for the rest, it's just a simple quick API call, so I think we need to look at this code closer:

https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L356-L364 https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L457-L467

Those two need to be more efficient somehow, locking for 10 seconds doesn't seem to be the ideal thing. Perhaps we just "trust" that CAPI is doing the right thing and move to in progress.. but then in the cluster status update, we need to somehow figure out if there has been a recent change or not and wait there instead.