Closed rudoi closed 5 years ago
/priority important-soon /milestone v0.4.x
@detiber: The provided milestone is not valid for this repository. Milestones in this repository: [Next
, v0.1.x
, v0.2.x
]
Use /milestone clear
to clear the milestone.
/milestone v0.2.x
Some thoughts on options:
In CAPA it seems that deletes are blocking (it looked like it didn't reconcile any other AWSMachine
s while the delete was ongoing) which meant we ran into this often on replicas: 12
maxSurge: 100%
-style rolling deployments. Turning up the concurrency has largely worked around this case, though.
What's proving a little stickier is AWS-side limits, etc., that have indefinite resolution timelines: if I ask for more m5.4xlarge
s than my quota says I can have, there's currently a 10 minute clock for me to 1) notice, 2) file the limit increase request, and 3) have the limit increase approved and applied to my account. When that window closes, I manually "retry" the machine by deleting it and letting the deployment re-create it.
@ncdc IMO it would be swell if capbk
could notice the machine was still provisioning
and keep pushing out the token expiry timestamp; I think that's the only option where I'm not racing AWS quotas against a clock somewhere.
@sethp-nr that's definitely an option too
@ncdc IMO it would be swell if capbk could notice the machine was still provisioning and keep pushing out the token expiry timestamp; I think that's the only option where I'm not racing AWS quotas against a clock somewhere.
I definitely like the idea of this approach as well, since it helps reduce the edge cases.
But I definitely think we could benefit from:
- Delay cloning the bootstrap and infra templates until the machine reconciler is ready to work on that machine (this would mean moving the cloning from machineset reconciler to machine reconciler)
Pushing out the token expiration in CABPK seems like a good approach to me, it definitely sounds less of an invading change than moving the cloning logic from one controller to another
I think the combination approach of both is probably best. From a security standpoint we want to limit the lifetime of the bootstrap tokens as much as possible.
I'm not sure I understand the "cloning delay" option – in this case, the long pole is the infrastructure provider. How would the Machine controller be signaled by the AWSMachine controller that it was ready to work?
@sethp-nr it's less that the capi machine controller gets signaled by the infra machine controller and more that the capi machine controller clones from the bootstrap and infra templates when it sees a machine "in this state". The state being "machine came from a machine set and the templates haven't been cloned yet". We'd need to work out the exact details, but this is the hand-waving version 😄
Oh, I think I understand. Right now the MachineSet controller is generating the bootstrap configs "all at once" even if the Machine controller is bogged down somewhere and can't get to them.
That said, I'm not sure if it helps my case very much – the Machine controller is fine, I think, so I would expect it to reconcile the bootstrap config at about the same time the MachineSet currently does. I'm pretty sure I saw the Machine go through lots of reconcile cycles looking to set a NodeRef before the AWSMachine got its "quota exceeded" error.
We're getting bit by this with some frequency ("rapid iteration" yay) so I was planning to pick up attempting to refresh the token TTL – it sounds like that's at least a part of the desired state.
/assign /lifecycle active
Should I transfer this to the CABPK repo in that case?
Oh, sure – or I could file an issue over there to assign/mark as active if we want to leave the discussion on template cloning open.
I'll move this one and we can open a new one here if we want to discuss moving cloning around.
/priority important-soon /milestone v0.1.x
/lifecycle active
/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
kubeadm join
What did you expect to happen:
Successful kubeadm join.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
This problem is mitigated when the concurrency flags are set high, but we may want to have a mechanism for extending the life of a token while machines are still provisioning.
When the templates are duplicated by the MachineSet controller, basically you get all the KubeadmConfigs instantly (meaning the token is already created and ticking away), but if Machine reconciliation isn't fast enough (EC2 slow or concurrency set to 1, etc), many of those tokens could be expired by the time the nodes are attempting to join.
cc @sethp-nr
Environment:
CAPI / CAPA / CABPK
kubectl version
): 1.15.3/etc/os-release
): ubuntu 18.04