rockstor / rockstor-core

Linux/BTRFS based Network Attached Storage(NAS)
http://rockstor.com/docs/contribute_section.html
GNU General Public License v3.0
552 stars 137 forks source link

BTRFS Raid (2+ disk) root unable to boot. #1980

Closed flukejones closed 9 months ago

flukejones commented 5 years ago

The issue comes from installing with a btrfs root on two volumes, the install completes but the system is unable to boot and comes up with a open_ctree failed with an initrd shell.

The fix is to append boot commands such as

rootflags=device=/dev/sda2,device=/dev/sdb2,rootfstype=btrfs

or

rootflags=device=/dev/disk/by-partuuid/6dc5624c-2d54-4726-b2fa-a7a988d337a4,device=/dev/disk/by-partuuid/b57f2240-fa2e-4516-9049-603d2c5029b5,rootfstype=btrfs

to the grub entry or GRUB_CMD in /etc/default/grub. This really needs to be done on install.

(I also had to remove the existing rootflags=).

---edit--- It seems that all that is required is for btrfs device scan to be run in the initrd on boot.

Steps to fix new install

NOTE: I have not tried this with /boot on the root partition, only on a separate partition. I imagine booting from a btrfs raid setup may not actually work. In fact the Oracle website recommends the partitioning below too.

Partitions required

Single partitions

Preferably on a single disk but you can put these on the two disk system; you will need to mirror the partition setup however to help keep things consistent.

1) /boot/efi, EFI type, can be small (50mb) 2) /boot, 200mb at least, this is the minimum size for two kernels

Raid partitions

1) a swap partition, preferably set up as raid 1 through the GUI 2) the remainder of space can be simply selected as BTRFS type, and given the / mount point

An alternative is to have / sized and separate from the remaining disk space - you seem to only need 3GiB.

Once the install is done, don't reboot yet. Ctrl+Alt+F2 to a terminal then:

1) chroot /mnt/sysimage, 2) lsblk to see which parts are mounted - the part at / will have a matching part on the other disk. btrfs doesn't mount both, but uses them both internally. 3) blkid will show you the matching partition of the btrfs raid, the two partitions will have the same UUID 4) blkid |grep /dev/sd[part of / ] >> /etc/default/grub 5) blkid |grep /dev/sd[matching of / ] >> /etc/default/grub (these two steps are to make editing the grub easier) 6) vi /etc/default/grub

You will have lines similar to

/dev/sda2: LABEL="rockstor" UUID="54dcbbf3-1fe0-4b56-befc-8b275120c872" UUID_SUB="8a191f4c-7a12-4d33-ab9c-156ad72598ec" TYPE="btrfs" PARTUUID="0c6cc0a6-d19a-4884-938a-41516ebb4f74"
/dev/sdb2: LABEL="rockstor" UUID="54dcbbf3-1fe0-4b56-befc-8b275120c872" UUID_SUB="ab092784-25cc-474a-8967-52316c790b06" TYPE="btrfs" PARTUUID="a5516130-605f-4d70-83aa-5c159380df77" 

at the end of the file for referencing. Change the existing GRUB_CMDLINE_LINUX= to GRUB_CMDLINE_LINUX_DEFAULT= and add a new GRUB_CMDLINE_LINUX= below. The contents of this line will need to contain references to the PARTUUID of the partitions. This is where your ace ViM skills will come in handy:

GRUB_CMDLINE_LINUX="rootflags=device=/dev/disk/by-partuuid/0c6cc0a6-d19a-4884-938a-41516ebb4f74,device=/dev/disk/by-partuuid/a5516130-605f-4d70-83aa-5c159380df77,subvol=root rootfstype=btrfs"

And don't forget to delete the reference lines at the end before you save.

7) grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg 8) reboot 9) Enjoy a nice, fast btrfs powered raided NAS.

Better way to fix boot

As above, install, but don't reboot. Get to console chroot and then;

1) rm /opt/rockstor/conf/64-btrfs.rules 2) cp /usr/lib/udev/rules.d/64-btrfs.rules /etc/udev/rules.d/ 3) dracut -fv 4) reboot.

flukejones commented 5 years ago

Maybe doesn't hurt to also add this to fstab, but I haven't required it so far.

phillxnet commented 5 years ago

@Luke-Nukem Nice find and thanks for sharing your findings; however at the level of the Rockstor code it's a little more complicated as we don't support (read don't know about) a system pool with more than a single member. There is likely to be deep bugs as a result of this configuration re Rockstor's presentation / management of the consequent pool (subvols/shares within mainly). Would be great to get this sorted in time but is not currently a priority, especially given the mid/longer term plans re moving to openSUSE who incidentally (from memory) also do not support multi disk btrfs root. A complication here is I think that currently our CentOS base has a separate /boot to aid in grub short-comings re btrfs raid (multi-dev) /, where as openSUSE do not have this multi partition/fs kludge.

So all a bit chicken and egg currently so sort of it is Rockstor doesn't support multi dev root btrfs as of yet and is unlikely to until our upstream does.

Linking to related issue by @greghensley: "Conflicting udev rules causing failure to boot with root on multi-device btrfs" #1329 re udev rule findings. which in turn links to the following forum thread re full disk btrfs and grub issues: https://forum.rockstor.com/t/install-to-full-disk-btrfs/1170/7

Missing here is a link to my 'from memory' openSUSE not supporting multidev btrfs root. Had a look but can't find it right now. If anyone finds this link please paste into this issue.

Well done on digging into this by the way and please leave this issue open as a reference / collection for this features relevant aspects in the future as it would be a really nice feature to support in the future, it also looks like upstream grub is gaining additional btrfs support going forward, but all this is really in the domain of our upstream linux base for the time being. Then we can extend our currently naive treatment of our root disk partition as a special case and move it over to using our btrfs in partition redirect role capability (has current TODO'ed block bug that I'm about to work on) that is used to support btrfs in partition for all other device cases.

Thanks again for sharing you findings on this one but for the time being I think it's best we focus on more common use cases that are still in need of attention / fixes. All in good time hopefully.

flukejones commented 5 years ago

@phillxnet I went ahead and edited the first post to contain further instructions so that others may find them easily. I think certain types of raid aren't supported by grub (like 5/6?), and I'm not 100% sure grub can boot a kernel from a btrfs managed raid either, but I suspect it can.

There are also people out there that just convert the whole disks to btrfs too, that is, /dev/sda + /dev/sdb etc, not the partitions.

So far the NAS has been pretty solid with root btrfs raid. Except for a recent reboot where a disk went AWOL for some reason and my shares didn't get mounted. It may end up being easier to use whole-disk btrfs in the end, with a swap file. We'll see.

Regarding "Conflicting udev rules causing failure to boot with root on multi-device btrfs" #1329 Yeah.... that fixed the need to put the devices in the rootflags. We need to remove that ancient rule (will do PR).

phillxnet commented 5 years ago

@Luke-Nukem Nice additions to the original post. "... you seem to only need 3GiB." I would like to point out that this issue deviates from what is currently a recognised 'legit' Rockstor install, ie: from: Minimum system requirements: http://rockstor.com/docs/quickstart.html#minimum-system-requirements

"8GB hard disk space for the OS" etc

A major design drive in Rockstor as I, and I think the majority of other contributors, see it is ease of use / simplicity. The above scenario is pretty unfriendly: but a nice set of info for those wanting to tinker with way more advanced installs than are recommended.

Also for others viewing this issue and interested in multi device root we have in the doc the following: Mirroring Rockstor OS using Linux Raid: http://rockstor.com/docs/mdraid-mirror/boot_drive_howto.html which is another very convoluted way to achieve raid (this time mdraid) root which also gives swap and /boot raid capability. And with that configuration the Rockstor code does work as intended. It is not currently aware of multi device btrfs raid.

I'm thinking now that this issue has grown beyond an actionable issue and really represents an experimental tail of stretching Rockstor's current aims. This is of course welcome but I think it belongs more in the forum as a topic of discussion re experimental installs. Plus we have greater facilities and more eyes there. It has however lead to the link to #1329 so has served a development purpose.

@Luke-Nukem Please consider transferring you progressively better 'how-to' write up to a forum post of it's own. We could also quote the discussion had so far here to that thread so we have the existing context. I think it would serve as a better location for discussion and design aims / changes etc. That way we keep issues as focused as possible. We can always open a fresh issue that exactly specifies a required task or observed problem and there is always the option to link back to a forum thread as we have done already in this issue.

Well done on getting this far along with what is quite a challenging setup by the way.

Also re full disk btrfs root you will find in my prior reference: "... which in turn links to the following forum thread re full disk btrfs and grub issues: https://forum.rockstor.com/t/install-to-full-disk-btrfs/1170/7"

2 linux-btrfs mailing list discussion references that address the difficulties on that one (grub limitations), short answer is they suggested the requirement still for a /boot with non btrfs.

"I think certain types of raid aren't supported by grub (like 5/6?)" Yes but that has also seen some development in upstream of late, see the following linux-btrfs mailing list thread: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg80917.html

As always in tech, and especially in computing we are on shifting sands so your experimentation with this config is most welcome. But do please bear in mind that I am proposing CentOS as a legacy OS for the future of Rockstor. Plus the treatment of btrfs in openSUSE is as a first class citizen; quite the opposite of it's status in RedHat / CentOS.

Lets move this discussion / experimentation reporting to the forum and see if we can't get more voices.

Well done again on your progress here.

flukejones commented 5 years ago

Also for others viewing this issue and interested in multi device root we have in the doc the following: Mirroring Rockstor OS using Linux Raid: http://rockstor.com/docs/mdraid-mirror/boot_drive_howto.html which is another very convoluted way to achieve raid (this time mdraid) root which also gives swap and /boot raid capability. And with that configuration the Rockstor code does work as intended. It is not currently aware of multi device btrfs raid.

Given that the installer allows the creation of such setups, I think you really need to support the different ways of doing it. This will also be true when the rebase on openSUSE is done, unless you don't use their installer.


I decided to get my Asustor NAS up again to see how it was laying out the disks/volumes.

screenshot from 2018-11-01 22-03-48

screenshot from 2018-11-01 22-03-59

screenshot from 2018-11-01 22-04-11

sda         8:0    0   1.8T  0 disk  
├─sda1      8:1    0   255M  0 part  
├─sda2      8:2    0     2G  0 part  
│ └─md125   9:125  0     2G  0 raid1 /volume0
├─sda3      8:3    0     2G  0 part  
│ └─md127   9:127  0     2G  0 raid1 swap
└─sda4      8:4    0   1.8T  0 part  
  └─md126   9:126  0   1.8T  0 raid1 /volume1
sdb         8:16   0   1.8T  0 disk  
├─sdb1      8:17   0   255M  0 part  
├─sdb2      8:18   0     2G  0 part  
│ └─md125   9:125  0     2G  0 raid1 /volume0
├─sdb3      8:19   0     2G  0 part  
│ └─md127   9:127  0     2G  0 raid1 swap
└─sdb4      8:20   0   1.8T  0 part  
  └─md126   9:126  0   1.8T  0 raid1 /volume1

Which is incidentally the layout I used for RockStor. It also does something similar to RockStor for handling of shares too:

/dev/md1 on /volume1 type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/md1 on /share/home type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/md1 on /share/Public type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/md1 on /share/Web type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/md1 on /share/certs type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/md1 on /share/Media type ext4 (rw,relatime,data=ordered,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)

The end result is similar to RockStor with disks, pools, shares. But the internal handling is rather different I think - the raid choice made during initialisation determines how it creates all partitions, which I imagine simplifies handling a lot. Speaking of initialisation the Asustor ADM OS is interesting, it boots a minimal initrd a USB DOM which it runs from ram, with a lighttpd server for the GUI, it uses this to run a proprietary init program which then sets up the disks with partitions, then it unpacks a full system in to the part mounted at /volume0. This thing runs super light on RAM, at around 150mb initialised with a docker daemon running.

It seems to track the partition layout, and uses this for mirroring and pool setup. But the key thing is that it uses the same raid type for each partition. This should still be fine in Rockstor with a setup of:

sda         8:0    0   1.8T  0 disk  
├─sda1      8:1    0   6G  0 btrfs / (-m raid1 -d raid0)  
├─sda2      8:2    0     2G  0 part  
│ └─md127   9:127  0     2G  0 raid1 swap
└─sda3      8:4    0   1.8T  0 btrfs /mnt2/pools (-m raid1 -d raid0)    
sdb         8:16   0   1.8T  0 disk  
├─sdb1      8:18   0     6G  0 btrfs / (-m raid1 -d raid0)
├─sdb2      8:19   0     2G  0 part  
│ └─md127   9:127  0     2G  0 raid1 swap
└─sdb3      8:20   0   1.8T  0 btrfs /mnt2/pools (-m raid1 -d raid0)  

where Rockstor would enforce both the prt1 and part2 arrangement, then any extra parts or disks added would be formatted as mkfs.btrfs -m raid1 -d raid0 /dev/sda3 /dev/sdb3 enforced. Given that btrfs handles raiding of disks internally if added like that (I think metadata and data setup is default) it should be well safe.

Perhaps we need the ability to present a choice for install: 1) install system to single disk 2) install system to raid, boot from single disk 3) install and boot from raid

and do automatic partitioning and formatting with a small set of inputs such as swap size.

cmurf commented 3 years ago

Specifically these should be in the initrd: usr/lib/udev/rules.d/64-btrfs.rules /usr/lib/udev/rules.d/64-btrfs-dm.rules

At least the first one, provided by systemd, needs to be present in order to make sure udev doesn't inform systemd that this file system is ready for mounting, until all devices show up. If there's even a small delay, mount fails if all devices aren't ready, hence this rule.

Of course this same rule means there's an indefinite wait in case of device failure, and requires user intervention to boot degraded. There's no automatic degraded mount yet for btrfs volumes; needs simultaneous work (a) update initramfs to have a countdown timer similar to mdadm degraded assemble after 300s or whatever the timeout is (b) an updated udev rule or maybe just remove it.

The second rule is part of btrfs-progs and is needed for Btrfs on dm-crypt and/or LVM.

Hooverdan96 commented 9 months ago

@phillxnet I assume this one is still relevant after the OpenSUSE transition?

phillxnet commented 9 months ago

@Hooverdan96 I'll closing this as although there is a wealth of info, it's now old, and we are now as you say openSUSE based and using kiwi-ng to build our installer. Any modification re root drive setup would have to be upstream really. So now out-of-scope for use. At least for the time being. Plus this has received no other community interest. We just don't have the contibution person-power/interest to consider this currently.