mdadm failing on v1.5.1

filoozom commented 5 years ago

RancherOS Version: v1.5.1

Where are you running RancherOS? baremetal

Hi,

So I just upgraded from v1.3.0 to v1.5.1, and manually replaced grub with syslinux as it now seems to be the default, and ros os upgrade doesn't change the grub configuration.

I have a fairly simple setup, with a USB drive for RANCHER_BOOT and a mdadm RAID 1 with a single ext4 partition for RANCHER_STATE. The issue is that I can't get mdadm to work with this new version.

I've got rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait in the kernel parameters, and this is /var/log/boot/ros-bootstrap.log when I start in debug mode:

time="2019-03-18T22:51:18Z" level=debug msg="START: [ros-bootstrap] in /"
time="2019-03-18T22:51:18Z" level=debug msg=bootstrapAction
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: loadingConfig"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: Rngd(true)"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: MdadmScan(true)"
time="2019-03-18T22:51:20Z" level=error msg="Failed to run mdadm scan: exit status 1"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: cryptsetup(false)"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: LvmScan(false)"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: stateScript()"
time="2019-03-18T22:51:20Z" level=debug msg="bootstrapAction: RunCommandSequence([])"
time="2019-03-18T22:52:06Z" level=debug msg="bootstrapAction: udev settle2"

Apparently, mdadm fails with exit status 1, although it works flawlessly once I get access to the console with mdadm --assemble --scan, which detects the drives without any issues.

Did something change with v1.5.1 or syslinux?

I saw #2065 but it doesn't seem related.

niusmallnan commented 5 years ago

@filoozom We also run mdadm --assemble --scan in the bootstrap process. Can you check syslog or message? It would be helpful if there are some error outputs.

filoozom commented 5 years ago

@niusmallnan To be honest there's not a whole lot in any of those log files. I only see drive initialisation:

Mar 19 00:25:16 rancher kernel: [   15.723215] sd 8:0:1:0: Attached scsi generic sg5 type 0
Mar 19 00:25:16 rancher kernel: [   15.726806] sd 8:0:1:0: [sdf] 488397168 512-byte logical blocks: (250 GB/233 GiB)
Mar 19 00:25:16 rancher kernel: [   15.728086] sd 8:0:1:0: [sdf] Write Protect is off
Mar 19 00:25:16 rancher kernel: [   15.728088] sd 8:0:1:0: [sdf] Mode Sense: 7f 00 10 08
Mar 19 00:25:16 rancher kernel: [   15.728633] sd 8:0:1:0: [sdf] Write cache: enabled, read cache: enabled, supports DPO and FUA
Mar 19 00:25:16 rancher kernel: [   15.743987] sd 8:0:1:0: [sdf] Attached SCSI disk

and

Mar 19 00:25:16 rancher kernel: [   15.724007] sd 8:0:6:0: Attached scsi generic sg10 type 0
Mar 19 00:25:16 rancher kernel: [   15.728394] sd 8:0:6:0: [sdk] 488397168 512-byte logical blocks: (250 GB/233 GiB)
Mar 19 00:25:16 rancher kernel: [   15.729659] sd 8:0:6:0: [sdk] Write Protect is off
Mar 19 00:25:16 rancher kernel: [   15.729661] sd 8:0:6:0: [sdk] Mode Sense: 7f 00 10 08
Mar 19 00:25:16 rancher kernel: [   15.730198] sd 8:0:6:0: [sdk] Write cache: enabled, read cache: enabled, supports DPO and FUA
Mar 19 00:25:16 rancher kernel: [   15.744177] sd 8:0:6:0: [sdk] Attached SCSI disk

(where I removed the 10 other irrelevant drives), after which there is nothing for 45 secondes, then:

Mar 19 00:25:16 rancher kernel: [   59.467272] Initializing XFRM netlink socket
Mar 19 00:25:16 rancher kernel: [   59.478280] Netfilter messages via NETLINK v0.30.
Mar 19 00:25:16 rancher kernel: [   59.480995] ctnetlink v0.93: registering with nfnetlink.
Mar 19 00:25:16 rancher kernel: [   59.689144] IPv6: ADDRCONF(NETDEV_UP): docker-sys: link is not ready
Mar 19 00:25:16 rancher acpid: starting up with netlink and the input layer
Mar 19 00:25:16 rancher acpid: 2 rules loaded
Mar 19 00:25:16 rancher acpid: waiting for events: event logging is off
Mar 19 00:25:17 rancher dhcpcd[1609]: dhcpcd-6.11.5 starting
Mar 19 00:25:17 rancher dhcpcd[1610]: sending commands to master dhcpcd process
Mar 19 00:25:17 rancher dhcpcd[1610]: send OK

This is while running in debug mode. Nothing for mdadm except the exit status as far as I can tell. Then immediately when I get access to the console:

$ sudo mdadm --assemble --scan
mdadm: /dev/md/1 has been started with 2 drives.

$ cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdf1[0] sdk1[1]
      244066432 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

kingsd041 commented 5 years ago

@filoozom I didn't find the key information in the log, so I tried to reproduce the issue according to your steps. I use sda1 as the boot partition. md127 is labeled "RANCHER_STATE" label. I did not reproduce, not sure if I missed any key information:

[root@rancher ~]# ros -v
version v1.3.0 from os image rancher/os:v1.3.0

[root@rancher ~]# cat /proc/cmdline
BOOT_IMAGE=../vmlinuz-4.9.80-rancher printk.devkmsg=on rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait panic=10 console=tty0 rancher.password=rancher rancher.state.mdadm_scan  initrd=../initrd-v1.3.0

[root@rancher ~]# blkid
/dev/sdb2: UUID="5916e90a-fe34-30a7-bb5a-b8323625524a" UUID_SUB="bfa16880-866a-b91a-bca8-d423eb88bd3f" LABEL="rancher:1" TYPE="linux_raid_member" PARTUUID="0ddf97d8-02"
/dev/sr0: UUID="2018-03-28-03-53-35-00" LABEL="RancherOS" TYPE="iso9660" PTUUID="4e28b21f" PTTYPE="dos"
/dev/sda1: LABEL="RANCHER_BOOT" UUID="1a82eae2-eddc-44b5-bec2-d4d1539fa348" TYPE="ext4" PARTUUID="be2a3c54-01"
/dev/sda2: UUID="5916e90a-fe34-30a7-bb5a-b8323625524a" UUID_SUB="316858f7-fad3-446d-ecf6-83cdae00f581" LABEL="rancher:1" TYPE="linux_raid_member" PARTUUID="be2a3c54-02"
/dev/md127: LABEL="RANCHER_STATE" UUID="7817994a-1356-481c-8d6e-21e3b24e2572" TYPE="ext4"

[root@rancher ~]# fdisk -l
Disk /dev/sdb: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0ddf97d8

Device     Boot Start     End Sectors Size Id Type
/dev/sdb2        2048 6293503 6291456   3G 83 Linux

Disk /dev/sda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbe2a3c54

Device     Boot   Start     End Sectors Size Id Type
/dev/sda1  *       2048 2099199 2097152   1G 83 Linux
/dev/sda2       2099200 8390655 6291456   3G 83 Linux

Disk /dev/md127: 3 GiB, 3221159936 bytes, 6291328 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
[root@rancher ~]# df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                   2.9G    415.3M      2.3G  15% /
tmpfs                   970.9M         0    970.9M   0% /dev
tmpfs                  1001.0M         0   1001.0M   0% /sys/fs/cgroup
/dev/md127                2.9G    415.3M      2.3G  15% /opt
/dev/md127                2.9G    415.3M      2.3G  15% /mnt
/dev/md127                2.9G    415.3M      2.3G  15% /media
/dev/md127                2.9G    415.3M      2.3G  15% /home
none                   1001.0M    788.0K   1000.2M   0% /run
/dev/md127                2.9G    415.3M      2.3G  15% /etc/resolv.conf
none                   1001.0M    788.0K   1000.2M   0% /var/run
/dev/md127                2.9G    415.3M      2.3G  15% /usr/lib/firmware
/dev/md127                2.9G    415.3M      2.3G  15% /var/log
/dev/md127                2.9G    415.3M      2.3G  15% /usr/lib/modules
/dev/md127                2.9G    415.3M      2.3G  15% /etc/docker
/dev/md127                2.9G    415.3M      2.3G  15% /usr/sbin/iptables
/dev/md127                2.9G    415.3M      2.3G  15% /etc/logrotate.d
devtmpfs                970.9M         0    970.9M   0% /host/dev
shm                      64.0M         0     64.0M   0% /host/dev/shm
/dev/md127                2.9G    415.3M      2.3G  15% /etc/selinux
/dev/md127                2.9G    415.3M      2.3G  15% /etc/hosts
/dev/md127                2.9G    415.3M      2.3G  15% /etc/hostname
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/rancher
/dev/md127                2.9G    415.3M      2.3G  15% /usr/bin/ros
/dev/md127                2.9G    415.3M      2.3G  15% /usr/bin/system-docker
/dev/md127                2.9G    415.3M      2.3G  15% /usr/share/ros
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/kubelet
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/docker
/dev/md127                2.9G    415.3M      2.3G  15% /usr/bin/system-docker-runc
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/rancher/cache
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/rancher/conf
/dev/md127                2.9G    415.3M      2.3G  15% /etc/ssl/certs/ca-certificates.crt.rancher
devtmpfs                970.9M         0    970.9M   0% /dev
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/docker/plugins
/dev/md127                2.9G    415.3M      2.3G  15% /var/lib/docker/overlay2

Upgrade from 1.3.0 to 1.5.1

[root@rancher ~]# ros os upgrade --upgrade-console -i rancher/os:v1.5.1
Upgrading to rancher/os:v1.5.1
Continue [y/N]: y
Pulling os-upgrade (rancher/os:v1.5.1)...
v1.5.1: Pulling from rancher/os
6c40cc604d8e: Pull complete
805953092b62: Pull complete
ec15c1060150: Pull complete
a1e83515706e: Pull complete
161392459fc0: Pull complete
1e50067d8472: Pull complete
f8779bd433a6: Pull complete
cd088b01d832: Pull complete
538d3376e53a: Pull complete
8f262ad165b3: Pull complete
Digest: sha256:2a3220b2493d683b353dc68505508c21b471a4bc818c905dafbb7a11aa54b1e7
Status: Downloaded newer image for rancher/os:v1.5.1
os-upgrade_1 | Installing from :v1.5.1
Continue with reboot [y/N]: y
INFO[0008] Rebooting
INFO[0008] Setting reboot timeout to 60 (rancher.shutdown_timeout set to 60)
....^[            ] reboot:info: Setting reboot timeout to 60 (rancher.shutdown_timeout set to 60)
.=.[            ] reboot:info: Stopping /docker : 34954e860683
...........D...........[            ] reboot:info: Stopping /ntp : fa094662401b
.>.[            ] reboot:info: Stopping /network : a6650affbda4
....=...[            ] reboot:info: Stopping /udev : d01af77c7b11
B.[            ] reboot:info: Stopping /system-cron : 02017bfbe6b0
<[            ] reboot:info: Stopping /syslog : 25fa536a5441
.;[            ] reboot:info: Stopping /acpid : cf4518a80b29
H.[            ] reboot:info: Console Stopping [/console] : b54756727e8a
Connection to 192.168.99.121 closed by remote host.
Connection to 192.168.99.121 closed.

Waiting for the restart to succeed, reconnecting rancheros

[root@rancher ~]# ros -v
version v1.5.1 from os image rancher/os:v1.5.1

[root@rancher ~]# cat /proc/mdstat
Personalities : [raid1]
md127 : active raid1 sda2[0] sdb2[1]
      3145664 blocks super 1.0 [2/2] [UU]

unused devices: <none>

[root@rancher ~]# blkid
/dev/sda1: LABEL="RANCHER_BOOT" UUID="1a82eae2-eddc-44b5-bec2-d4d1539fa348" TYPE="ext4" PARTUUID="be2a3c54-01"
/dev/sda2: UUID="5916e90a-fe34-30a7-bb5a-b8323625524a" UUID_SUB="316858f7-fad3-446d-ecf6-83cdae00f581" LABEL="rancher:1" TYPE="linux_raid_member" PARTUUID="be2a3c54-02"
/dev/sr0: UUID="2018-03-28-03-53-35-00" LABEL="RancherOS" TYPE="iso9660" PTUUID="4e28b21f" PTTYPE="dos"
/dev/sdb2: UUID="5916e90a-fe34-30a7-bb5a-b8323625524a" UUID_SUB="bfa16880-866a-b91a-bca8-d423eb88bd3f" LABEL="rancher:1" TYPE="linux_raid_member" PARTUUID="0ddf97d8-02"
/dev/md127: LABEL="RANCHER_STATE" UUID="7817994a-1356-481c-8d6e-21e3b24e2572" TYPE="ext4"

root@rancher ~]# fdisk -l
Disk /dev/sda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbe2a3c54

Device     Boot   Start     End Sectors Size Id Type
/dev/sda1  *       2048 2099199 2097152   1G 83 Linux
/dev/sda2       2099200 8390655 6291456   3G 83 Linux

Disk /dev/sdb: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0ddf97d8

Device     Boot Start     End Sectors Size Id Type
/dev/sdb2        2048 6293503 6291456   3G 83 Linux

Disk /dev/md127: 3 GiB, 3221159936 bytes, 6291328 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

filoozom commented 5 years ago

@kingsd041 Might this be related to the fact that I didn't use --upgrade-console? I was using the Alpine one, don't know which version though.

Also, are you using grub or syslinux in your example? I had to upgrade from grub to syslinux as it would still boot back to v1.3.0 if I didn't do it manually.

What information could be helpful for you? I think I now basically have a clean v1.5.1 copy as it doesn't load any cloud-init data because it can't assemble the mdraid for some obscure reason. I couldn't find anything interesting in the logs either. I guess I should try a clean installation or go back to v1.3.0?

kingsd041 commented 5 years ago

@filoozom
--upgrade-console just upgrades the console to the latest, nothing to do with it.

In the example, I used syslinux. In the v1.3.0 version, syslinux is used by default. I am more curious about why your RancherOS is grub?

If possible, it is recommended to reinstall v1.5.1 to verify this issue.

filoozom commented 5 years ago

I'm guessing that the v1.3.0 was already an upgrade from an older version using grub? I can't remember to be honest, those servers have been sitting there for some time now.

I think it was related to the fact that the installer can't / couldn't install on RAID, so I did something around the lines of https://medium.com/@sthulb/rancher-os-raid-fc1128385de6.

I'll try a clean install as soon as I can!

kingsd041 commented 5 years ago

If it's syslinux, you also need to refer to https://forums.rancher.com/t/installation-rancher-os-no-grub-install/10225

filoozom commented 5 years ago

It's been some time since I've had the time to have a look at this again, but today I tried again with 1.5.2. Basically the same setup:

A USB key (/dev/sdm) from which I boot with only syslinux and v1.5.2 (only contains /boot)
Two SSDs (/dev/sdf and /dev/sdk) in software RAID 1 with RANCHER_STATE label
A bunch of other disks for a zpool redacted from the output

[root@rancher ~]# ros -v
version v1.5.2 from os image rancher/os:v1.5.2

[root@rancher ~]# cat /proc/cmdline 
BOOT_IMAGE=../vmlinuz-4.14.122-rancher rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.5.2 rancher.autologin=tty1 rancher.autologin=ttyS0

[root@rancher ~]# blkid
/dev/sdf1: UUID="af61fda8-58a9-c40b-f337-7b5f25b5a2b5" UUID_SUB="92f1412c-a600-04aa-73da-465fca6f3b5d" LABEL="rancher:1" TYPE="linux_raid_member" PARTLABEL="RANCHER_STATE" PARTUUID="3950e825-85a1-4acb-b14e-5fb7ed229992"
/dev/sdm1: LABEL="RANCHER_BOOT" UUID="95cdf244-8191-4154-880c-0274f06e5b98" TYPE="ext4" PARTLABEL="RANCHER_BOOT" PARTUUID="37c8ce3b-eb3c-480c-876e-39ee93fd05ed"
/dev/sdk1: UUID="af61fda8-58a9-c40b-f337-7b5f25b5a2b5" UUID_SUB="955b1ccb-6a7d-5902-643b-b19291ac6c28" LABEL="rancher:1" TYPE="linux_raid_member" PARTLABEL="RANCHER_STATE" PARTUUID="e7321419-6980-4083-8e77-99b0c97a933a"
/dev/md1: LABEL="RANCHER_STATE" UUID="5ba5f121-20d5-450e-af3c-85ca754598eb" TYPE="ext4"

[root@rancher ~]# fdisk -l
Disk /dev/sdf: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3F7B64FB-BB8B-F74C-89C3-E58DF4B0F13F

Device     Start       End   Sectors   Size Type
/dev/sdf1   2048 488397134 488395087 232.9G Linux filesystem

Disk /dev/sdm: 7.5 GiB, 8011120640 bytes, 15646720 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 34985674-D305-42FA-B6A1-4C8F68AF8D6C

Device     Start      End  Sectors  Size Type
/dev/sdm1   2048 15646686 15644639  7.5G Linux filesystem

Disk /dev/sdk: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3F7B64FB-BB8B-F74C-89C3-E58DF4B0F13F

Device     Start       End   Sectors   Size Type
/dev/sdk1   2048 488397134 488395087 232.9G Linux filesystem

Disk /dev/md1: 232.8 GiB, 249924026368 bytes, 488132864 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x20ac7dda

Device     Boot      Start        End    Sectors   Size Id Type
/dev/md1p1      3224498923 3657370551  432871629 206.4G  7 HPFS/NTFS/exFAT
/dev/md1p2      3272020941 5225480974 1953460034 931.5G 16 Hidden FAT16
/dev/md1p3               0          0          0     0B 6f unknown
/dev/md1p4        50200576  974536369  924335794 440.8G  0 Empty

Partition table entries are not in disk order.

[root@rancher ~]# cat /proc/mdstat
Personalities :
unused devices: <none>

Nothing in /proc/mdstat. I forgot to check fdisk -l and blkid before mdadm --assemble --scan, but I assume that /dev/md1 would be in the logs above. Once I run mdadm --assemble --scan:

[root@rancher ~]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdf1[0] sdk1[1]
      244066432 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

As for logs:

[root@rancher ~]# cat /var/log/syslog | grep mdadm
Jun 22 21:41:45 rancher kernel: [    0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.14.122-rancher rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.5.2 rancher.autologin=tty1 rancher.autologin=ttyS0
Jun 22 21:41:45 rancher kernel: [    0.000000] Kernel command line: BOOT_IMAGE=../vmlinuz-4.14.122-rancher rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.5.2 rancher.autologin=tty1 rancher.autologin=ttyS0
Jun 22 21:42:42 rancher sudo:  rancher : TTY=tty1 ; PWD=/home/rancher ; USER=root ; COMMAND=/sbin/mdadm --assemble --scan
[root@rancher ~]# cat /var/log/messages | grep mdadm
Jun 22 21:41:45 rancher kernel: [    0.000000] Command line: BOOT_IMAGE=../vmlinuz-4.14.122-rancher rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.5.2 rancher.autologin=tty1 rancher.autologin=ttyS0
Jun 22 21:41:45 rancher kernel: [    0.000000] Kernel command line: BOOT_IMAGE=../vmlinuz-4.14.122-rancher rancher.state.mdadm_scan rancher.state.dev=LABEL=RANCHER_STATE rancher.state.wait console=tty0 initrd=../initrd-v1.5.2 rancher.autologin=tty1 rancher.autologin=ttyS0

And in /var/log/boot/ros-bootstrap.log, still the same exit status 1 error:

time="2019-06-22T22:08:38Z" level=debug msg="START: [ros-bootstrap] in /" 
time="2019-06-22T22:08:38Z" level=debug msg=bootstrapAction 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: loadingConfig" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: Rngd(true)" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: MdadmScan(true)" 
time="2019-06-22T22:08:40Z" level=error msg="Failed to run mdadm scan: exit status 1" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: cryptsetup(false)" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: LvmScan(false)" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: stateScript()" 
time="2019-06-22T22:08:40Z" level=debug msg="bootstrapAction: RunCommandSequence([])" 
time="2019-06-22T22:09:18Z" level=debug msg="bootstrapAction: udev settle2"

So I guess it's not related to the upgrade process but to newer versions, although it 100% worked previously. Am I missing something? Could it be that rancher doesn't wait long enough for drives to get picked up before running mdadm as I've got a bunch of drives? No hardware or BIOS configuration has changed in between those versions so I really have no clue what could be going on here.

Also, running rancher/os-bootstrap:v1.5.2 manually works too... I'm at a loss for this one. It must be because it started too soon for some reason?

[root@rancher ~]# docker run --net none --privileged -v /dev:/host/dev -v /lib/modules:/lib/modules -v /lib/firmware:/lib/firmware -v /usr/bin/ros:/usr/bin/ros:ro -v /usr/bin/ros:/usr/bin/ros-bootstrap:ro -v /usr/share/ros:/usr/share/ros:ro -v /var/lib/rancher:/var/lib/rancher:ro -v /var/log:/var/log rancher/os-bootstrap:v1.5.2 ros-bootstrap
Unable to find image 'rancher/os-bootstrap:v1.5.2' locally
v1.5.2: Pulling from rancher/os-bootstrap
d0c25b2fa8e7: Pull complete 
89831ee9b353: Pull complete 
878fe0c82308: Pull complete 
c0a0dd48df33: Pull complete 
Digest: sha256:61611f7157226a07f71586f5369233657ced0031ed54d889d782e7166b132f9e
Status: Downloaded newer image for rancher/os-bootstrap:v1.5.2
[            ] ros-bootstrap:debug: bootstrapAction: Rngd(true)
[            ] ros-bootstrap:debug: bootstrapAction: MdadmScan(true)
mdadm: /dev/md/rancher:1 has been started with 2 drives.
[            ] ros-bootstrap:debug: bootstrapAction: cryptsetup(false)
[            ] ros-bootstrap:debug: bootstrapAction: LvmScan(false)
[            ] ros-bootstrap:debug: bootstrapAction: stateScript()
[            ] ros-bootstrap:debug: bootstrapAction: RunCommandSequence([])
[            ] ros-bootstrap:debug: bootstrapAction: udev settle2

Just tried on v1.3.0 and actually it doesn't work either...

rancher / os

mdadm failing on v1.5.1 #2718