Closed accepting closed 4 years ago
Welcome @accepting!
It looks like this is your first PR to kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Hi @accepting. Thanks for your PR.
I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
The second control plane can't release the lock that the first control plane is holding. If it does release the lock that is a bug.
Reading this code https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/blob/master/internal/locking/control_plane_init_mutex.go#L100-L126 it seems like we release the lock without checking if the Machine that has the lock is the same as the current one?
@vincepri !!! yeah that would be a bug 😞
@vincepri @chuckha I don't think it's a good idea that only the owner machine can release the lock. As far as I know, the lock should be hold from the first acquiring to the time when cluster.Status.ControlPlaneInitialized
turns to be true. Therefore, the lock owner maybe have no chance to release it.
In the easy case when there is simply one control plane machine:
Now a new bootstrap config is created for a worker machine.
Let's say there are two control planes and two managers. This situation should never happen since the deployment uses leader election to ensure only a single manager is running. But let's say this does happen.
The case you originally outlined indeed holds true. C1 acquires the lock. C2 attempts to acquire the lock and cannot and errors out and thus removes the lock. C3 is created and acquires the lock before C1 finishes initializing. Now you have two clusters. Very bad.
What if we never released the lock in the successful case?
I think one clean way to do this is to acquire the lock and never unlock it. The only time we want to unlock the lock is if we hit an error condition after the lock is created but before the bootstrap data is generated.
This simplifies the code a bit. This change is an improvement but we should still delete the Unlock code on line 265.
Interested to hear your thoughts @accepting
@chuckha In my opinion, the exact place to release the lock is where cluster.Status.ControlPlaneInitialized
is set to true in the cluster controller. If we can not make the cluster controller do the work, the Unlock code on line 265 may be the most right place in CABPK.
The only bad effect I can see is that the Unlock code on line 265 will be involved every time a new bootstrap config is created afterwards. That is really unnecessary.
If we never released the lock in the successful case, it can avoid the bad effect above. But It may lead some bad effects. For example,
{cluser-namespace}/{cluster-name}-lock
always exist. And the configmap contains C1's machine name.{cluser-namespace}/{cluster-name}-lock
exists, and the information in the configmap is C1's machine name which not equals to NC1's machine name, NC1 fails to acquire the init lock. Hence, NC1 can not generate bootstrap data forever.@accepting the config map has an owner reference so it will be deleted when we delete the owning machine
another way this would work is:
If the bootstrap config already has bootstrap data and the bootstrap config is holding the lock and the cluster.status.controlplaneinitialized is true, we can then delete the lock at that point
@chuckha You are right, I missed that. But the owner reference is set as the cluster now. We may need to change it to the related machine.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: accepting, chuckha
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Signed-off-by: JunLi lijun.git@gmail.com
What this PR does / why we need it: Fix bug for the init lock.
Consider the following case: Two control plane machines(note them as C1, C2) are creating concurrently. Assume C1 first acquires the init lock. Since C2 failed to get lock, it exits the reconcile loop and releases the init lock. Before C1 finished to initialize itself, we create another control plane machine(note it as C3). Obviously, C3 will acquire the init lock successfully and also generate controlplane-init config which is what we do not want to see.