AzureMachinePool UX: stays in Updating state until CNI is installed - Githubissues

kubernetes-sigs / cluster-api-provider-azure

Cluster API implementation for Microsoft Azure

https://capz.sigs.k8s.io/

Apache License 2.0

295 stars 425 forks source link

AzureMachinePool UX: stays in Updating state until CNI is installed #2722

Open CecileRobertMichon opened 2 years ago

CecileRobertMichon commented 2 years ago

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Create a cluster with "machinepool" flavor following quickstart instructions:

export WORKER_MACHINE_COUNT=1 clusterctl generate cluster test-mp --flavor machinepool | kubectl apply -f -

Notice that the VMSS becomes ready, and the MachinePoolMachines are in Succeeded state but the AzureMachinePool staus stuck in Updating:

➜  cluster-api-provider-azure git:(main) kubectl get azuremachinepool                              
NAME           REPLICAS   READY   STATE
test-mp-mp-0                      Updating
➜  cluster-api-provider-azure git:(main) kubectl get azuremachinepoolmachines
NAME             VERSION   READY   STATE
test-mp-mp-0-0   v1.24.5           Succeeded

This repros with v1.5.1.

What's interesting is that this is seemingly not reproducing on our e2e tests which are testing release-1.5: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-e2e-full-v1beta1 (double checked that the test waits for the MachinePool ready replicas to be == to the spec replicas, which would timeout above).

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version: v1.5.1
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

mboersma commented 2 years ago

/assign

mboersma commented 2 years ago

I tried this with main and make tilt-up + the machinepool flavor from Tilt, and it behaved correctly within a few minutes:

% k get azuremachinepool
NAME                     REPLICAS   READY   STATE
machinepool-27094-mp-0   2          true    Succeeded
% k get azuremachinepoolmachines
NAME                       VERSION   READY   STATE
machinepool-27094-mp-0-0   v1.23.9   true    Succeeded
machinepool-27094-mp-0-1   v1.23.9   true    Succeeded

I'll try again specifically with v1.5.1 and the quickstart route.

mboersma commented 2 years ago

I can repro by following the quick start:

% clusterctl init --infrastructure azure
Fetching providers
Installing cert-manager Version="v1.9.1"
Waiting for cert-manager to be available...
Installing Provider="cluster-api" Version="v1.2.3" TargetNamespace="capi-system"
Installing Provider="bootstrap-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="control-plane-kubeadm" Version="v1.2.3" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="infrastructure-azure" Version="v1.5.2" TargetNamespace="capz-system"
...
% k get azuremachinepool        
NAME           REPLICAS   READY   STATE
test-mp-mp-0                      Updating
% k get azuremchinepoolmachines
NAME             VERSION   READY   STATE
test-mp-mp-0-0   v1.24.5           Succeeded

Edit: I think this failed because I hadn't followed through with installing Calico CNI to the workload cluster. In further testing, that seems to be they key.

CecileRobertMichon commented 2 years ago

have you tried with tilt + v1.5.1 tag? Just to know if this is a tilt vs. clusterctl or v1.5.1 vs main branch difference

mboersma commented 2 years ago

Machinepool works just fine using make tilt-up in CAPZ with the v1.5.1 tag. Seems to be a clusterctl- or Quick Start-related issue, rather than a change in our code.

mboersma commented 2 years ago

The template generated by clusterctl generate cluster test-mp --flavor machinepool is basically identical to that generated by clicking the "machinepool" link in CAPZ Tilt. I just wanted to rule that out as a difference. I'll use the "known working" cluster template for further testing regardless.

mboersma commented 2 years ago

I'm seeing this behavior (AzureMachinePoolMachines come up but the AzureMachinePool stays stuck at "updating") if I don't install Calico as recommended for Azure in the Quick Start. Once I install the manifest and Calico starts running, both AMP resource types soon move to READY=true and STATE=Succeeded.

Maybe there's a more informative status we could apply to an AMP in this case?

primeroz commented 1 year ago

this is my experience too, without a working CNI the nodes never become ready and so the AMP get stuck

dtzar commented 1 year ago

@mboersma - will this be fixed or is fixed with any of your PRs? People shouldn't have to install calico to make it work (i.e. versus Azure CNI) and if we require a CNI provider (even if not Calico), we definitely should document this.

CecileRobertMichon commented 1 year ago

/milestone v1.8

k8s-ci-robot commented 1 year ago

@CecileRobertMichon: The provided milestone is not valid for this repository. Milestones in this repository: [next, v1.9]

Use /milestone clear to clear the milestone.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2722#issuecomment-1464462270): >/milestone v1.8 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mboersma commented 1 year ago

/milestone v1.9

mboersma commented 1 year ago

/milestone v1.11

willie-yao commented 1 year ago

/milestone next

k8s-ci-robot commented 1 year ago

@willie-yao: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2722#issuecomment-1682590129): >/milestone next Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mboersma commented 1 year ago

/unassign /milestone next

I haven't made any progress on this unfortunately and I'm not likely to for this release cycle.

Jont828 commented 1 year ago

/milestone next

k8s-ci-robot commented 1 year ago

@Jont828: You must be a member of the kubernetes-sigs/cluster-api-provider-azure-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider Azure Maintainers and have them propose you as an additional delegate for this responsibility.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/2722#issuecomment-1791047465): >/milestone next Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

willie-yao commented 7 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

willie-yao commented 4 months ago

/remove-lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

willie-yao commented 1 month ago

/remove-lifecycle stale