vmware / photon

Minimal Linux container host
https://vmware.github.io/photon
Other
3.06k stars 697 forks source link

cloud-init: write_files: does not run on instance creation (init runs twice in logs?) #1110

Open prologic opened 3 years ago

prologic commented 3 years ago

Hello Team,

We have a very simple Package template and then some terraform that creates VM resources using this template. We have noticed some strange behaviour when it comes to using Cloud-init. Specifically:

Our package template is nothing fancy and mostly just consists of:

{
  "variables": {
    "iso_url": "https://packages.vmware.com/photon/3.0/Rev3/iso/photon-minimal-3.0-a383732.iso",
    "iso_checksum": "c2883a42e402a2330d9c39b4d1e071cf9b3b5898",
    "vm-cpu-num": "1",
    "vm-disk-size": "25600",
    "vm-mem-size": "1024",
    "vm-name": "packer-devops-photon-3",
    "vm_notes": "Photon OS 3.x - {{isotime \"2006-01-02\"}}",
    "vsphere-cluster": "",
    "vsphere-datacenter": "",
    "vsphere-datastore": "",
    "vsphere-network": "",
    "vsphere-server": "",
    "vsphere-host": "",
    "vsphere-user": "{{ env `VSPHERE_USERNAME` }}",
    "vsphere-password": "{{ env `VSPHERE_PASSWORD` }}"
  },
  "builders": [
    {
      "CPUs": 1,
      "RAM": 1024,
      "RAM_reserve_all": false,
      "boot_command": [
        "<esc><wait>",
        "vmlinuz initrd=initrd.img root=/dev/ram0 loglevel=3 ks=http://x.x.x.x/repository/ks/kickstart.json photon.media=cdrom",
        "<enter>"
      ],
      "shutdown_command": "shutdown -h now",
      "cluster": "{{user `vsphere-cluster`}}",
      "convert_to_template": true,
      "datacenter": "{{user `vsphere-datacenter`}}",
      "datastore": "{{user `vsphere-datastore`}}",
      "host": "{{user `vsphere-host`}}",
      "iso_target_path": "[xxx] packer_cache/",
      "disk_controller_type": "pvscsi",
      "guest_os_type": "vmwarePhoton64Guest",
      "iso_url": "{{user `iso_url`}}",
      "iso_checksum": "{{user `iso_checksum`}}",
      "network_adapters": [
        {
          "network": "{{user `vsphere-network`}}",
          "network_card": "vmxnet3"
        }
      ],
      "notes": "{{user `vm_notes`}}",
      "password": "{{user `vsphere-password`}}",
      "ssh_username": "root",
      "ssh_password": "***",
      "storage": [
        {
          "disk_size": 25000,
          "disk_thin_provisioned": true
        }
      ],
      "type": "vsphere-iso",
      "username": "{{user `vsphere-user`}}",
      "vcenter_server": "{{user `vsphere-server`}}",
      "vm_name": "{{user `vm-name`}}",
      "folder": "aom/templates"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "scripts": [
        "./scripts/update.sh",
        "./scripts/setup.sh",
        "./scripts/cleanup.sh"
      ]
    },
    {
      "type": "shell",
      "inline": "echo 'Template build complete ({{isotime \"2006-01-02 03:04:05\"}})!'"
    }
  ]
}

The provisioning scripts don't do much either:

./scripts/update.sh:

#!/bin/sh

set -e

# Update
# tdnf -y update

./scripts/setup.sh:

#!/bin/sh

set -e

systemctl start docker
systemctl enable docker
docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions

And finally:

./scripts/cleanup.sh:

#!/bin/sh

set -ex

# Clean up TDNF cache
tdnf clean all

# Clean up logs
cat /dev/null > /var/log/wtmp 2> /dev/null
#logrotate -f /etc/logrotate.conf 2> /dev/null
find /var/log -type f -delete
rm -rf /var/log/journal/*
rm -f /var/lib/dhcp/*

# Cleanup cloud-init
systemctl stop cloud-init
cloud-init clean

# Cleanup root account
unset HISTFILE && history -c && rm -fr /root/.bash_history

# Cleanup system config
echo > /etc/machine-id
echo > /etc/hostname

# Zero out the free space to save space in the final image, blocking 'til
# written otherwise, the disk image won't be zeroed, and/or Packer will try to
# kill the box while the disk is still full and that's bad.  The dd will run
# 'til failure, so (due to the 'set -e' above), ignore that failure.  Also,
# really make certain that both the zeros and the file removal really sync; the
# extra sleep 1 and sync shouldn't be necessary, but...)
#dd if=/dev/zero of=/EMPTY bs=1M || true
sync
sleep 1
sync
#rm -f /EMPTY
#sync
#sleep 1
#sync

Note that cloud-init clean we run during the cleanup phase. This gets packaged and uploaded to vSphere and we boot from this. But this is where cloud-init falls over.

Attached is the cloud-init.log from one of our instances where we see this behavior.

Once the instance is up and running (with the half-ran cloud-init config) if we shell into it and run cloud-init clean && reboot it correctly runs all of the cloud-init config. But this is not ideal as it doesn't run it correctly in the first place on initial creation/boot.

Thanks! cloud-init.log

prologic commented 3 years ago

@ssahani BTW, just saw your label. Just to let you know however that we've also tried v4.0-beta and it exhibits teh same behavior.

ssahani commented 3 years ago

/cc @keerthanakalyan @sshedi

sshedi commented 3 years ago

Hi @prologic,

Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.

Also please attach your cloud config file from /etc/cloud/cloud.cfg and your userdata file which you are providing to cloud-init.

FYI, cloud-init clean won't clear the log files, so upon next run newer logs will just get appended to older ones.

You can use cloud-init clean -ls (more info at: https://cloudinit.readthedocs.io/en/latest/topics/cli.html)

prologic commented 3 years ago

Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.

Interesting. We do however run systemctl stop cloud-init before running cloud-init clean. Surely this can't be the cause?

prologic commented 3 years ago

Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.

Interesting. We do however run systemctl stop cloud-init before running cloud-init clean. Surely this can't be the cause?

For example putting a sleep 300 in the shell provisioning scripts that get run after the initial system is installed to disk from the ISO using kickstart, rebooted and ran once ssh is available. Same behavior.

Either I'm doing something incredibly wrong here, or something else is going on.

prologic commented 3 years ago

FWIW I can confirm that the Pakcer build and provisioning shell scripts are running correctly. I did something like this (greatly simplified):

touch /root/BUILT_BY_PACKAGE
sleep 300
...
systemctl stop cloud-init
cloud-init clean-ls

Booting these images I can see the timestamp of /root/BUILT_BY_PACKER a full ~8-9 mins earlier. The entries in /var/log/cloud-init.log are all correctly the boot-up time of the instances. What's weird is yes you are right, I can see that it's running things twice for some strange reason. But I'm not sure that's anything I've caused :/

sshedi commented 3 years ago

Can you give the cloud init log after using sleep 300 before clean? Also your userdata and cloud.cfg as I mentioned earlier?

prologic commented 3 years ago

Can you give the cloud init log after using sleep 300 before clean? Also your userdata and cloud.cfg as I mentioned earlier?

Here you go:

user-data:

#cloud-config
preserve_hostname: false
hostname: dm1-test2
fqdn: dm1-t

est2.xxx
users:
  - name: root
    lock_passwd: false
    hashed_passwd: xxx
    ssh-authorized-keys:
     - ssh-rsa xxx
    shell: /bin/bash

write_files:
  - path: /etc/docker/daemon.json
    content: |
      {
          "debug" : false,
          "log-driver": "loki",
          "log-opts": {
              "loki-url": "http://127.0.0.1:3100/loki/api/v1/push"
          }
      }

final_message: "System initialisation finished after $UPTIME seconds"

cloud-init.log.1.txt cloud.cfg.txt

prologic commented 3 years ago

Not sure if this helps at all, but we've switch over to using Ubuntu 20.04 just to get something going in our vSphere environment with cloud-init support and ran into a number of very different problems there. Which my colleague managed to get to the bottom of with the following fixes in our shell "cleanup" provisioner:

echo "cleanup cloud-init"
curl -sSL https://raw.githubusercontent.com/vmware/cloud-init-vmware-guestinfo/master/install.sh | sudo sh -
sudo systemctl stop cloud-init
sudo rm -rf /var/lib/cloud/ /run/cloud-init
sudo cloud-init clean
sudo vmware-rpctool 'info-set guestinfo.metadata .'
sudo vmware-rpctool 'info-set guestinfo.userdata .'

# Fixes the "Virtual machine customization failed" during terraform deployment
echo "vSphere fix https://kb.vmware.com/s/article/56409"
sudo sed -i 's/D \/tmp 1777 root root -/#D \/tmp 1777 root root -/g' /usr/lib/tmpfiles.d/tmp.conf
sudo sed -i '/After=systemd-remount-fs.service systemd-tmpfiles-setup.service systemd-modules-load.service/a After=dbus.service' /lib/systemd/system/open-vm-tools.service

# Fixes cloud-init from not grabbing vsphere userdata
echo "vSphere fix https://kb.vmware.com/s/article/80934"
sudo rm -rf /etc/cloud/cloud.cfg.d/subiquity-disable-cloudinit-networking.cfg
sudo rm -rf /etc/cloud/cloud.cfg.d/99-installer.cfg

echo "cleanup system config"
sudo sh -c '> /etc/machine-id'
sudo sh -c '> /etc/hostname'

None of these "work-arounds" however seem relevant here for VMWare Photon OS specifically I don't think (I did try just to rule them out)...

sshedi commented 3 years ago

I have a question for you:

Does /etc/docker directory exist upon first boot? I think not. Upon doing systemctl start docker - /etc/docker gets created. Give it a try.

sshedi commented 3 years ago

Okay, my previous observation is not the cause. cloud-init will ensure that directory is present before writing file or else it will create directory.

I used your sample yaml config and ran it in my environment and things seem to be fine.

2021-02-24 08:37:13,175 - helpers.py[DEBUG]: Running config-write-files using lock (<FileLock using file '/var/lib/cloud/instances/f9652e42-6aa8-31d5-0813-36e147db5630/sem/config_write_files'>) 2021-02-24 08:37:13,175 - util.py[DEBUG]: Writing to /etc/docker/daemon.json - wb: [644] 136 bytes 2021-02-24 08:37:13,175 - util.py[DEBUG]: Changing the ownership of /etc/docker/daemon.json to 0:0 2021-02-24 08:37:13,175 - handlers.py[DEBUG]: finish: init-network/config-write-files: SUCCESS: config-write-files ran successfully

My question now is: how are you providing your userdata yaml config file to cloud-init? I mean, how cloud-init is consuming your yaml input file?

prologic commented 3 years ago

via VMware via VMware Guest Info

sshedi commented 3 years ago

I did the following:

$ cloud-init clean $ systemctl stop cloud-init $ vmware-rpctool "info-set guestinfo.userdata $(cat config.yml)" $ systemctl restart cloud-init*

One more hunch is, what if tdnf -y update is causing contention while cloud-init is in the middle of something?

prologic commented 3 years ago

One more hunch is, what if tdnf -y update is causing contention while cloud-init is in the middle of something?

I've already ruled that out. In my tests I have this commented out:

update.sh:

#!/bin/sh

set -e

# Touch a file
touch /root/BUILT_BY_PACKER

# Wait for the system to settle
sleep 300

# Update
# tdnf -y update
prologic commented 3 years ago

Also FWIW the user-data that is being pushed to VMWare Guestinfo via Terraform when creating these instances is correct. I can see it once the VM boots up.

root@dm1-test2 [ ~ ]# vmware-rpctool 'info-get guestinfo.userdata' | base64 -d
#cloud-config
preserve_hostname: false
hostname: dm1-test2
fqdn: dm1-test2.xxx

users:
  - name: root
    lock_passwd: false
    hashed_passwd: xxx
    ssh-authorized-keys:
     - ssh-rsa xxx
    shell: /bin/bash

write_files:
  - path: /etc/docker/daemon.json
    content: |
      {
          "debug" : false,
          "log-driver": "loki",
          "log-opts": {
              "loki-url": "http://127.0.0.1:3100/loki/api/v1/push"
          }
      }

final_message: "System initialisation finished after $UPTIME seconds"

The problem is this pesky write_files that doesn't work correctly, and as pointed out early in this issue there are hints that some part of cloud-init's init process is running twice or something.

I don't understand why this doesn't work.

sshedi commented 3 years ago

Okay, I will give it one more shot. Please give the complete log using cloud-init collect-logs and attach the tarball here.

And do this on a fresh instance of setup.

arunmk commented 2 years ago

hi folks, I am seeing a similar issue where some of the write_files sections are not executed sporadically. A cloud-init clean and reboot fixes it.

This is however on an Ubuntu 20.04 OS. How do we know if the issues are related?

craig-fields commented 10 months ago

I had a problem with the same symptoms a couple of years ago - write_files appears to do nothing (when going from Ubuntu 16, where these scripts I inherited used to work, to Ubuntu 20, where they no longer do).

I tracked this down to the fact that I was writing files into the /opt directory, which is actually for me a mountpoint.

As of Ubuntu 20, the files would be written into the /opt directory before the partition was mounted, then the partition would be mounted, hiding the files that were written.

Then my command section, running after the mount, could not find the files. Once you know the problem of course, there are workarounds.

I do not know in this context if perhaps /etc/docker is actually a mountpoint as well.