Azure VM unable to come in service

saurabrana commented 1 year ago

/kind bug

[yes, we have you checked the Troubleshooting Guide?]

What steps did you take and what happened: [We are trying to launch an Azure VM, but eventually, it fails to come into service due to failure in the CAPz extensions .]

What did you expect to happen: VM should be launched successfully and able to connect to the cluster always.

Anything else you would like to add: cloudbase-init.log Logs.zip

Environment:

cluster-api-provider-azure version: v1.8.0
Kubernetes version: 1.25.7
OS : windows build - 10.0.20348

hermesimi commented 11 months ago

We are facing a similar issue. Did you find any solution?

CecileRobertMichon commented 11 months ago

Hi @saurabrana, have you tried with a newer version of CAPZ? could you please share repro steps (what do your cluster, machines look like)

@hermesimi the VM extension failing just means that k8s node join failed, it could be for a variety of reasons. Are you also seeing this on windows? what k8s version? what CAPZ version?

hermesimi commented 11 months ago

Hey there @CecileRobertMichon. We did not test windows. Updated all versions last week just to make sure.

NAME                    NAMESPACE                           TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-kubeadm       capi-kubeadm-bootstrap-system       BootstrapProvider        v1.5.2            Already up to date
control-plane-kubeadm   capi-kubeadm-control-plane-system   ControlPlaneProvider     v1.5.2            Already up to date
cluster-api             capi-system                         CoreProvider             v1.5.2            Already up to date
infrastructure-azure    capz-system                         InfrastructureProvider   v1.11.1           Already up to date

Same thing happened again today. vmss serial logs showed

[[0;1;31mFAILED[0m] Failed to start [0;1;39mExecute cloud user/final scripts[0m.
[  338.420057] cloud-init[1554]: [2023-09-26 17:03:23] Cloud-init v. 23.2.2-0ubuntu0~22.04.1 running 'modules:final' at Tue, 26 Sep 2023 17:03:23 +0000. Up 33.74 seconds.
See 'systemctl status cloud-final.service' for details.
[  338.420195] cloud-init[1554]: [2023-09-26 17:03:25] [preflight] Running pre-flight checks
[[0;32m  OK  [0m] Reached target [0;1;39mCloud-init target[0m.
[  338.420593] cloud-init[1554]: [2023-09-26 17:08:27] error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID "e9xgm3"
[  338.421017] cloud-init[1554]: [2023-09-26 17:08:27] To see the stack trace of this error execute with --v=5 or higher
[  338.421413] cloud-init[1554]: [2023-09-26 17:08:27] 2023-09-26 17:08:27,867 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  338.421840] cloud-init[1554]: [2023-09-26 17:08:27] 2023-09-26 17:08:27,867 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
[  338.422256] cloud-init[1554]: [2023-09-26 17:08:27] Cloud-init v. 23.2.2-0ubuntu0~22.04.1 finished at Tue, 26 Sep 2023 17:08:27 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 338.38 seconds
2023-09-26T17:18:15.965353Z INFO Daemon Agent WALinuxAgent-2.9.1.1 launched with command 'python3 -u bin/WALinuxAgent-2.9.1.1-py3.8.egg -run-exthandlers' is successfully running

Any pointers?

CecileRobertMichon commented 11 months ago

Are those VMs getting created as part of the original cluster creation or is this a scaling event outside of CAPZ by any chance (autoscaler, Azure portal, etc)?

Can you please share repro steps?

error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID "e9xgm3"

seems like the bootstrap token isn't valid

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3903#issuecomment-2197803658): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-azure

Azure VM unable to come in service #3903