Open richardcase opened 2 years ago
@richardcase: This issue is currently awaiting triage.
If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
There were some discussions around this before. This might worth discussing in cluster-api office hours given that all providers are being affected by this.
There were some discussions around this before. This might worth discussing in cluster-api office hours given that all providers are being affected by this.
Good idea. I will add a agenda item for this.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
From triage 12/2022: Let's add to agenda for next office hours. Core CAPI MachineDeployment does not support multiple failure domains. Please see https://github.com/kubernetes-sigs/cluster-api/issues/3358. We'll hold off on applying /triage label until then.
For reference: Oracle and MicroVM Infrastructure Providers do distribute machines in one MachineDeployment across multiple failure domain. (Links to be added here)
Discussed in the 6th Jan 2023 office hours.
As discussed in the CAPA office hours, Indeed had several CAPA workload clusters (self-managed, non-eks) spanning all az's in us-east-2 on july 28 2022 during the outage. Our clusters are configured to use machine deployments in each AZ and the cluster autoscaler is configured for autoscaling machine deployments with the clusterapi provider. We also configure the cluster autoscaler and all of the CAPI/CAPA controllers to use leader election and run 3 replicas of each.
What we observed was that when power to AZ1 was lost, 10 minutes later (I believe 10 minutes is due to the 5 minutes delay for the nodes to be marked unready due to missing kubelet heartbeat + 5 minutes for the pod-eviction-timeout of the kube-controller-manager, but I'm not 100% certain), pods were recreated by kubernetes scheduler without any outside interaction, and were in the pending state. The cluster autoscaler scaled up the machine deployments, and as soon as the machines joined the cluster, workloads scheduled and workloads continued to perform normally, despite the control plane being in a degraded state. No human intervention was required for the cluster recovery after AZ1 was restored or during the outage.
Below are two sets of graphs from one of those clusters, which shows the control plane becoming degraded (2/3 available), and then the pods scheduled / created. The pods are scheduled in 3 "waves" as machines join the cluster and then allow more pods to schedule.
I can provide more specific details on how the MD's were configured if that's useful.
So I wonder if instead of implementing this feature, documentation on how to correctly configure CAPA clusters to sustain an AZ outage would be more desirable?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /remove-lifecycle rotten
@richardcase: Reopened this issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /remove-lifecycle rotten
@richardcase: Reopened this issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/kind feature
Describe the solution you'd like Currently, CAPI will spread control plane machines across the reported failure domains (i.e. availability zones). It doesn't do this for worker nodes, machines in a machine deployment (or machines on their own).
Current advice is to create separate machine deployments and manually assign an az (via
FailureDomain
) to each of the machine deployments to ensure that you have worker machines in different azs.It would be better when creating machines (if no failure domain is specified on the
Machine
) that we use the failuredomains on theCluster
and create the machine in a failure domain with the least amount of machine already. CAPI has some functions we could potentially use. Something like this:Anything else you would like to add: We need to investigate if this is feasible, or if it is something that should be upstream in machine deployments.
Environment:
kubectl version
):/etc/os-release
):