Open prologic opened 3 years ago
@ssahani BTW, just saw your label. Just to let you know however that we've also tried v4.0-beta and it exhibits teh same behavior.
/cc @keerthanakalyan @sshedi
Hi @prologic,
Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.
Also please attach your cloud config file from /etc/cloud/cloud.cfg and your userdata file which you are providing to cloud-init.
FYI, cloud-init clean won't clear the log files, so upon next run newer logs will just get appended to older ones.
You can use cloud-init clean -ls
(more info at: https://cloudinit.readthedocs.io/en/latest/topics/cli.html)
Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.
Interesting. We do however run systemctl stop cloud-init
before running cloud-init clean
. Surely this can't be the cause?
Looks like there is some contention going on in your setup. cloud-init clean is being done before cloud-init finished running all stages.
Interesting. We do however run
systemctl stop cloud-init
before runningcloud-init clean
. Surely this can't be the cause?
For example putting a sleep 300
in the shell provisioning scripts that get run after the initial system is installed to disk from the ISO using kickstart, rebooted and ran once ssh is available. Same behavior.
Either I'm doing something incredibly wrong here, or something else is going on.
FWIW I can confirm that the Pakcer build and provisioning shell scripts are running correctly. I did something like this (greatly simplified):
touch /root/BUILT_BY_PACKAGE
sleep 300
...
systemctl stop cloud-init
cloud-init clean-ls
Booting these images I can see the timestamp of /root/BUILT_BY_PACKER
a full ~8-9 mins earlier. The entries in /var/log/cloud-init.log
are all correctly the boot-up time of the instances. What's weird is yes you are right, I can see that it's running things twice for some strange reason. But I'm not sure that's anything I've caused :/
Can you give the cloud init log after using sleep 300 before clean? Also your userdata and cloud.cfg as I mentioned earlier?
Can you give the cloud init log after using sleep 300 before clean? Also your userdata and cloud.cfg as I mentioned earlier?
Here you go:
user-data
:
#cloud-config
preserve_hostname: false
hostname: dm1-test2
fqdn: dm1-t
est2.xxx
users:
- name: root
lock_passwd: false
hashed_passwd: xxx
ssh-authorized-keys:
- ssh-rsa xxx
shell: /bin/bash
write_files:
- path: /etc/docker/daemon.json
content: |
{
"debug" : false,
"log-driver": "loki",
"log-opts": {
"loki-url": "http://127.0.0.1:3100/loki/api/v1/push"
}
}
final_message: "System initialisation finished after $UPTIME seconds"
Not sure if this helps at all, but we've switch over to using Ubuntu 20.04 just to get something going in our vSphere environment with cloud-init support and ran into a number of very different problems there. Which my colleague managed to get to the bottom of with the following fixes in our shell "cleanup" provisioner:
echo "cleanup cloud-init"
curl -sSL https://raw.githubusercontent.com/vmware/cloud-init-vmware-guestinfo/master/install.sh | sudo sh -
sudo systemctl stop cloud-init
sudo rm -rf /var/lib/cloud/ /run/cloud-init
sudo cloud-init clean
sudo vmware-rpctool 'info-set guestinfo.metadata .'
sudo vmware-rpctool 'info-set guestinfo.userdata .'
# Fixes the "Virtual machine customization failed" during terraform deployment
echo "vSphere fix https://kb.vmware.com/s/article/56409"
sudo sed -i 's/D \/tmp 1777 root root -/#D \/tmp 1777 root root -/g' /usr/lib/tmpfiles.d/tmp.conf
sudo sed -i '/After=systemd-remount-fs.service systemd-tmpfiles-setup.service systemd-modules-load.service/a After=dbus.service' /lib/systemd/system/open-vm-tools.service
# Fixes cloud-init from not grabbing vsphere userdata
echo "vSphere fix https://kb.vmware.com/s/article/80934"
sudo rm -rf /etc/cloud/cloud.cfg.d/subiquity-disable-cloudinit-networking.cfg
sudo rm -rf /etc/cloud/cloud.cfg.d/99-installer.cfg
echo "cleanup system config"
sudo sh -c '> /etc/machine-id'
sudo sh -c '> /etc/hostname'
None of these "work-arounds" however seem relevant here for VMWare Photon OS specifically I don't think (I did try just to rule them out)...
I have a question for you:
Does /etc/docker
directory exist upon first boot? I think not.
Upon doing systemctl start docker
- /etc/docker
gets created. Give it a try.
Okay, my previous observation is not the cause. cloud-init will ensure that directory is present before writing file or else it will create directory.
I used your sample yaml config and ran it in my environment and things seem to be fine.
2021-02-24 08:37:13,175 - helpers.py[DEBUG]: Running config-write-files using lock (<FileLock using file '/var/lib/cloud/instances/f9652e42-6aa8-31d5-0813-36e147db5630/sem/config_write_files'>) 2021-02-24 08:37:13,175 - util.py[DEBUG]: Writing to /etc/docker/daemon.json - wb: [644] 136 bytes 2021-02-24 08:37:13,175 - util.py[DEBUG]: Changing the ownership of /etc/docker/daemon.json to 0:0 2021-02-24 08:37:13,175 - handlers.py[DEBUG]: finish: init-network/config-write-files: SUCCESS: config-write-files ran successfully
My question now is: how are you providing your userdata yaml config file to cloud-init? I mean, how cloud-init is consuming your yaml input file?
via VMware via VMware Guest Info
I did the following:
$ cloud-init clean $ systemctl stop cloud-init $ vmware-rpctool "info-set guestinfo.userdata $(cat config.yml)" $ systemctl restart cloud-init*
One more hunch is, what if tdnf -y update
is causing contention while cloud-init is in the middle of something?
One more hunch is, what if
tdnf -y update
is causing contention while cloud-init is in the middle of something?
I've already ruled that out. In my tests I have this commented out:
update.sh
:
#!/bin/sh
set -e
# Touch a file
touch /root/BUILT_BY_PACKER
# Wait for the system to settle
sleep 300
# Update
# tdnf -y update
Also FWIW the user-data
that is being pushed to VMWare Guestinfo via Terraform when creating these instances is correct. I can see it once the VM boots up.
root@dm1-test2 [ ~ ]# vmware-rpctool 'info-get guestinfo.userdata' | base64 -d
#cloud-config
preserve_hostname: false
hostname: dm1-test2
fqdn: dm1-test2.xxx
users:
- name: root
lock_passwd: false
hashed_passwd: xxx
ssh-authorized-keys:
- ssh-rsa xxx
shell: /bin/bash
write_files:
- path: /etc/docker/daemon.json
content: |
{
"debug" : false,
"log-driver": "loki",
"log-opts": {
"loki-url": "http://127.0.0.1:3100/loki/api/v1/push"
}
}
final_message: "System initialisation finished after $UPTIME seconds"
The problem is this pesky write_files
that doesn't work correctly, and as pointed out early in this issue there are hints that some part of cloud-init's init process is running twice or something.
I don't understand why this doesn't work.
Okay, I will give it one more shot. Please give the complete log using cloud-init collect-logs
and attach the tarball here.
And do this on a fresh instance of setup.
hi folks, I am seeing a similar issue where some of the write_files
sections are not executed sporadically. A cloud-init clean
and reboot fixes it.
This is however on an Ubuntu 20.04 OS. How do we know if the issues are related?
I had a problem with the same symptoms a couple of years ago - write_files appears to do nothing (when going from Ubuntu 16, where these scripts I inherited used to work, to Ubuntu 20, where they no longer do).
I tracked this down to the fact that I was writing files into the /opt directory, which is actually for me a mountpoint.
As of Ubuntu 20, the files would be written into the /opt directory before the partition was mounted, then the partition would be mounted, hiding the files that were written.
Then my command section, running after the mount, could not find the files. Once you know the problem of course, there are workarounds.
I do not know in this context if perhaps /etc/docker is actually a mountpoint as well.
Hello Team,
We have a very simple Package template and then some terraform that creates VM resources using this template. We have noticed some strange behaviour when it comes to using Cloud-init. Specifically:
write_files:
doesn't work correctly at all on first instance creation.init
seems to (_at least according to thecloud-init.log
) run twice?Our package template is nothing fancy and mostly just consists of:
The provisioning scripts don't do much either:
./scripts/update.sh
:./scripts/setup.sh
:And finally:
./scripts/cleanup.sh
:Note that
cloud-init clean
we run during the cleanup phase. This gets packaged and uploaded to vSphere and we boot from this. But this is where cloud-init falls over.Attached is the
cloud-init.log
from one of our instances where we see this behavior.Once the instance is up and running (with the half-ran cloud-init config) if we shell into it and run
cloud-init clean && reboot
it correctly runs all of the cloud-init config. But this is not ideal as it doesn't run it correctly in the first place on initial creation/boot.Thanks! cloud-init.log