jetsonhacks / rootOnNVMe

Switch the rootfs to a NVMe SSD on the Jetson Xavier NX and Jetson AGX Xavier
MIT License
393 stars 145 forks source link

Cold reboot loses SSD info[BUG] #4

Open sjyi opened 4 years ago

sjyi commented 4 years ago

Describe the issue Please describe the issue After copying over rootfs to SSD and then enabling booting from SSD, I reboot. It works fine as long as I'm simply warm booting. That is I don't unplug the power. When I unplug the power and then do a cold boot, the system doesn't see the SSD directories. I have to do another warm boot (reboot) in order to see SSD directories.

What version of L4T/JetPack L4T/JetPack version: 4.4 Which Jetson Jetson: NXXavier To Reproduce Steps to reproduce the behavior: For example, what command line did you run?

Setup the SSD. Clone the rootOnNVMe then copy rootfs ./copy-rootfs-ssd.sh then enable booting from SSD ./setup-service.sh

Reboot to enable SSD.

Now you have access to SSD.

At this point, you can shutdown and unplug the power or simply just unplug the power.

When the power is reconnected, SSD directories are not seen. You have to reboot again to see the SSD directories. Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

smyeungx commented 3 years ago

Dear Contributors,

Thanks for creating such a wonderful package. We also encounter this cold reboot on both NX and AGX Modify the setssdroot.sh a little bit so show where systemd start the setssdroot.sh script.

NORMAL STARTUP or REBOOT So During normal startup or reboot, the device /dev/nvme0n1p1 appears very soon after nvme enabling device command is issued: _[ 2.491446] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded [ 2.491510] aer 0005:00:00.0:pcie002: service driver aer loaded [ 2.491950] nvme nvme0: pci function 0005:01:00.0 [ 2.492023] nvme 0005:01:00.0: enabling device (0000 -> 0002) [ 2.501732] tegra-cbb 14040000.cv-noc: noc_secure_irq = 89, noc_nonsecureirq = 88> [ 2.506497] tegra194-isp5 14800000.isp: initialized [ 2.514111] tegra194-vi5 15c10000.vi: using default number of vi channels, 36 [ 2.518419] tegra194-vi5 15c10000.vi: initialized [ 2.522866] tegra194-vi5 15c10000.vi: subdev 15a00000.nvcsi--2 bound [ 2.522944] tegra194-vi5 15c10000.vi: subdev 15a00000.nvcsi--1 bound [ 2.523609] tegra186-cam-rtcpu bc00000.rtcpu: Trace buffer configured at IOVA=0xbff00000 [ 2.601813] nvme0n1: p1 p2 [ 2.606426] tegra-ivc ivc-bc00000.rtcpu: region 0: iova=0xbfee0000-0xbfefffff size=131072 [ 2.607071] tegra-ivc ivc-bc00000.rtcpu:echo@0: echo: ver=0 grp=1 RX[16x64]=0x1000-0x1480 TX[16x64]=0x1480-0x1900

After the device is detected, systemd launched setssdroot.service which invoke setssdroot.sh when the requirement ConditionPathExists=/dev/nvme0n1p1 in the service file is fullfilled. Seems that systemd start to run the service pretty early as soon as the EXT4-fs is remounted as expected (as indicate in the service file): [ 3.464040] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null) .... [ 3.895975] setssdroot: remount rootfs to nvme0n1p1 <-- added logging code to dmesg [ 3.980035] [EXT4 FS bs=4096, gc=3249, bpg=32768, ipg=8192, mo=e882c818, mo2=0002] [ 3.992243] EXT4-fs (nvme0n1p1): recovery complete [ 4.019774] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: debug,errors=continue,discard [ 4.060660] setssdroot: exit remount rootfs <-- added logging code to dmesg

COLD REBOOT But during a cold boot in the L4T in Jetson AGX, for some reason like file system recovery on an improperly unmounted partition upon failure or accidentally power off, the nvme0n1p1 device partition usually only detected at a relatively late stage after nvme enabling command is issued: [ 2.511762] nvme nvme0: pci function 0005:01:00.0 [ 2.512342] nvme 0005:01:00.0: enabling device (0000 -> 0002) ... [ 4.008445] hid-generic 0003:17EF:60EE.0005: hidraw4: USB HID v1.11 Device [Lenovo TrackPoint Keyboard II] on usb-3610000.xhci-2.4.4.2/input2 [ 4.450262] nvme0n1: p1 p2 [ 4.685762] random: crng init done

Therefore the device /dev/nvme0n1p1 appears after systemd executes the setssdroot.service and cannot fulfill the requirement: ConditionPathExists=/dev/nvme0n1p1 and thus the service is never executed in this case.

PROPOSED SOLUTION We have tried the following modification to setssdroot.service and indicate the service only started after the device /dev/nvme0n1p1 appears: _[Unit] Description=Change rootfs to SSD in M.2 key M slot (nvme0n1p1) DefaultDependencies=no Conflicts=shutdown.target

systemctl list-units --type=mount

After=systemd-remount-fs.service dev-nvme0n1p1.device Before=local-fs-pre.target local-fs.target shutdown.target Wants=local-fs-pre.target dev-nvme0n1p1.device ConditionPathExists=/dev/nvme0n1p1 ConditionPathExists=/etc/setssdroot.conf ConditionVirtualization=!container [Service] Type=oneshot RemainAfterExit=yes ExecStart=/sbin/setssdroot.sh [Install] WantedBy=default.target_

And modified the EXT4OPT with errors=continue in setssdroot.sh: #!/bin/sh #Runs at startup, switches rootfs to the SSD on nvme0 (M.2 Key M slot)_ _NVMEDRIVE="/dev/nvme0n1p1" _CHROOTPATH="/nvmeroot"

INITBIN=/lib/systemd/systemd _EXT4OPT="-o defaults -o debug -o errors=continue -o discard"

echo "setssdroot: mount and switch rootfs to nvme0n1p1" | tee /dev/kmsg

modprobe ext4

_mkdir -p ${CHROOTPATH} _mount -t ext4 ${EXT4_OPT} ${NVME_DRIVE} ${CHROOTPATH}

_cd ${CHROOTPATH} _/bin/systemctl --no-block switch-root ${CHROOTPATH}

echo "setssdroot: exit mount and switch rootfs" | tee /dev/kmsg

Seems the above approach may delay the boot process for 1-2s during file system recovery, but we try cold boot it over 20 times and seems it's working nicely on both NX/AGX. Please kindly check if this approach help resolves the issue. $ dmesg | grep -E 'setssd|EXT4-fs|rootfs|nvme' [ 0.973739] Trying to unpack rootfs image as initramfs... [ 2.513857] nvme nvme0: pci function 0005:01:00.0 [ 2.513983] nvme 0005:01:00.0: enabling device (0000 -> 0002) [ 2.786245] EXT4-fs (mmcblk0p1): recovery complete [ 2.786966] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null) [ 2.810175] Switching from initrd to actual rootfs [ 3.475874] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null) [ 4.012793] nvme0n1: p1 p2 [ 4.841735] setssdroot: remount rootfs to nvme0n1p1 [ 5.064657] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039474 [ 5.064881] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040079 [ 5.064943] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040131 [ 5.065021] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040106 [ 5.065143] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039479 [ 5.065241] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040092 [ 5.065398] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039565 [ 5.065444] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039530 [ 5.065491] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040073 [ 5.065540] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039621 [ 5.065584] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039772 [ 5.065652] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039488 [ 5.065715] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039492 [ 5.065776] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039527 [ 5.065821] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 18087972 [ 5.065886] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039505 [ 5.065933] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039503 [ 5.065982] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039504 [ 5.066024] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039480 [ 5.066065] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039433 [ 5.066107] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040067 [ 5.066142] EXT4-fs (nvme0n1p1): 21 orphan inodes deleted [ 5.066144] EXT4-fs (nvme0n1p1): recovery complete [ 5.108166] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: debug,errors=continue,discard [ 5.127778] setssdroot: exit remount rootfs [ 15.280425] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)

Best, Simon

Redox15 commented 1 year ago

I have to make this change to script in order to boot from SSD. But, in my case, it never boots from SSD without the change. However, I faced a new issue (#28). Has your system the CUDA installed?