kubernetes-sigs / image-builder

Tools for building Kubernetes disk images
https://image-builder.sigs.k8s.io/
Apache License 2.0
365 stars 369 forks source link

Cloud-init fails for ubuntu 20.04 base AMI and Cloud-init version '23.3.1-0ubuntu1~20.04.1' #1333

Closed supershal closed 1 month ago

supershal commented 9 months ago

What steps did you take and what happened:

The latest cloud-init version 23.3.1-0ubuntu1~20.04.1 that is shipped with base AMI for Ubuntu 20.04 is unable to run boothook https://cloudinit.readthedocs.io/en/latest/explanation/format.html#cloud-boothook provided by CAPA, https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/0bf78b04b305a77aec37a68c107102231faa7a16/pkg/cloud/services/secretsmanager/secret_fetch_script.go#L20 As a result the CAPA VMs are not initializing as expected.

Steps to reproduce:

  1. create an AMI using image-builder

    make build-ami-ubuntu-2004
  2. Create CAPA cluster using the AMI created in step 1 using instructions at: https://cluster-api-aws.sigs.k8s.io/getting-started.html

  3. Check logs at /var/log/cloud-init-output.log

What did you expect to happen: Cloud-init run successfully on the VM

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Log from cloud-init.

2023-10-24 18:53:21] 2023-10-24 18:53:21,892 - util.py[WARNING]: failed stage init
[2023-10-24 18:53:21] failed run of stage init
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 78, in read_file_or_url
[2023-10-24 18:53:21]     with open(file_path, "rb") as fp:
[2023-10-24 18:53:21] FileNotFoundError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 238, in _do_include
[2023-10-24 18:53:21]     resp = read_file_or_url(
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 84, in read_file_or_url
[2023-10-24 18:53:21]     raise UrlError(cause=e, code=code, headers=None, url=url) from e
[2023-10-24 18:53:21] cloudinit.url_helper.UrlError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 766, in status_wrapper
[2023-10-24 18:53:21]     ret = functor(name, args)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 453, in main_init
[2023-10-24 18:53:21]     init.update()
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 484, in update
[2023-10-24 18:53:21]     self._store_processeddata(self.datasource.get_userdata(), "userdata")
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 599, in get_userdata
[2023-10-24 18:53:21]     self.userdata = self.ud_proc.process(self.get_userdata_raw())
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 88, in process
[2023-10-24 18:53:21]     self._process_msg(convert_string(blob), accumulating_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 159, in _process_msg
[2023-10-24 18:53:21]     self._do_include(payload, append_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 264, in _do_include
[2023-10-24 18:53:21]     _handle_error(message, urle)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 72, in _handle_error
[2023-10-24 18:53:21]     raise RuntimeError(error_message) from source_exception
[2023-10-24 18:53:21] RuntimeError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:config' at Tue, 24 Oct 2023 18:53:37 +0000. Up 42.69 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:final' at Tue, 24 Oct 2023 18:53:40 +0000. Up 46.25 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 finished at Tue, 24 Oct 2023 18:53:40 +0000. Datasource DataSourceEc2Local.  Up 46.42 second

Environment:

Project (Image Builder for Cluster API:

Additional info for Image Builder for Cluster API related issues:

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

supershal commented 9 months ago

we were able to downgrade the cloud-init to 23.2.1-0ubuntu0~20.04.2 and create cluster successfully. https://github.com/mesosphere/konvoy-image-builder/pull/938 cc: @voor @cnmcavoy

We are still not sure of the root cause and change in cloud-init that resulted in this issue.

supershal commented 9 months ago

I was able to provide following override file to the image-builder and build AMI that can run CAPA cloud-init script successfully. pin-cloud-init-override.json :

{
    "ansible_extra_vars": "pinned_debs=\"cloud-init=23.1.2-0ubuntu0~20.04.2\""
}

I built the image using following makefile target of image-builder make build-ami-ubuntu-2004 PACKER_VAR_FILES=pin-cloud-init-override.json

We will have to now investigate what changes in 23.3.1-0ubuntu1~20.04.1 broke the CAPA cloud-init script.

voor commented 8 months ago

Moving over some comments from slack so they're not lost in the sands of time:

- name: Downgrade cloud init.
  apt:
    deb: http://launchpadlibrarian.net/679992659/cloud-init_23.2.2-0ubuntu0~20.04.1_all.deb
    state: present
    force: true

- name: Pin cloud init to prevent version issues.
  dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - cloud-init
dlipovetsky commented 8 months ago

For image-builder users who have hit this bug and are reading this issue:

We believe the root cause to be in cloud-init, and would like to fix it there (see https://github.com/canonical/cloud-init/issues/4572). We prefer to do this to the alternative, which is to "pin" an older, known-good cloud-init version in image-builder itself.

For now, if you use image-builder to create an Ubuntu 20.04 AMI, please use the workaround described in https://github.com/kubernetes-sigs/image-builder/issues/1333#issuecomment-1782093042.

dlipovetsky commented 6 months ago

This might be related to https://github.com/kubernetes-sigs/image-builder/pull/406 which historically caused issues with CAPA.

@supershal and I found that the feature override mechanism used in #406 does not work in the recent versions of cloud-init in Ubuntu 20.04. This mechanism was removed from cloud-init in https://github.com/canonical/cloud-init/pull/4228.

Patching cloud-init is the officially documented mechanism now:

Currently used upstream values for feature flags are set in cloudinit/features.py. Overrides to these values should be patched directly (e.g., via quilt patch) by downstreams.

I guess modifying the cloud-init python module to set ERROR_ON_USER_DATA_FAILURE = False is something image-builder can do for now. But once Ubuntu 20.04 is EOL, the feature flag itself will be removed.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/image-builder/issues/1333#issuecomment-2169061721): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.