openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

Very slow GRUB #8049

Closed kanavis closed 5 years ago

kanavis commented 5 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Stretch 9.5
Linux Kernel 4.9.0-8-amd64
Architecture amd64
ZFS Version 0.6.5.9-5
SPL Version 0.6.5.9-1

Describe the problem you're observing

I'm using grub-pc 2.02~beta3-5 (BIOS boot) and 10 drives in RAID10 ZFS root. After BIOS probes drives there is about two minutes of black screen before I see a GRUB menu. The same issue exists on another server, with older zfs modules and GRUB version. But there are only 4 drives, and timeout is about 30 seconds.

Describe how to reproduce the problem

Setup a ZFS root on a large raid array with BIOS GRUB. UEFI wasn't tested yet. UPD: same results with UEFI GRUB

bunder2015 commented 5 years ago

I was able to load this page by chance as github is currently experiencing issues. If you see this message and can't see your bug posting, I assume it will appear when github repairs their backend.

djkazic commented 5 years ago

Have you set the grub options to not be splash? Do you have a console dump to see what's holding up the boot? It could be something systemd related.

kanavis commented 5 years ago

Have you set the grub options to not be splash? Do you have a console dump to see what's holding up the boot? It could be something systemd related.

GRUB cmdline is empty. But as I understand it, the options you tell about are kernel and init options, and I experiense a long black screen BEFORE GRUB loads. Right after BIOS starting bootcode. If I enable debug=all in grub.cfg - I see VERY MUCH lines of zfs: read GPT header, verify, free/malloc, free/malloc, then again read GPT header. Soon I'll make a log of this output.

djkazic commented 5 years ago

Thanks, please include that -- it would be very helpful.

kanavis commented 5 years ago

Here is a GRUB debug log before filtered.log and after filtered.after.log

GRUB menu appeared. After it there is an enormous timeout too, about 30 seconds, before kernel starts.

kanavis commented 5 years ago

When I use the same configuration without ZFS, everything works momentally.

djkazic commented 5 years ago

What does update-grub output?

kanavis commented 5 years ago

What does update-grub output?

Generating grub configuration file ... Found linux image: /boot/vmlinuz-4.9.0-8-amd64 Found initrd image: /boot/initrd.img-4.9.0-8-amd64 Adding boot menu entry for EFI firmware configuration done

djkazic commented 5 years ago

Any reason you're using the beta version of grub-pc? I think that could be the culprit.

kanavis commented 5 years ago

Any reason you're using the beta version of grub-pc? I think that could be the culprit.

https://packages.debian.org/ru/stretch/grub-pc it's a version from debian stretch repos

kanavis commented 5 years ago

if you're using EFI you can just pop your kernel and initrd into the ESP, get rid of grub and never have to deal with its problems again. the problem is how incredibly unlikely it is for this project to get any changes upstreamed to grub.

Never done this before but it looks like a solution. Are there any manuals for this you recommend? I'm not sure if there are any caveats regarding ZFS.

djkazic commented 5 years ago

This is good reading on the subject: https://wiki.archlinux.org/index.php/EFISTUB

kanavis commented 5 years ago

Yep, that works great, only ~5-10 sec from staring EFI code until first kernel output row.

ReimuHakurei commented 5 years ago

On my NAS (consumer-grade HW [FX-8350 / AM3+ mobo] with a 16 disk storage pool and 1 disk rpool), with ZFS 0.8.0-rc3, it takes about 20 seconds or maybe less to get past GRUB.

On another system (Dell PowerEdge R415 w/ dual Opterons), the exact same configuration takes multiple minutes to load GRUB.

GRUB here is stock Ubuntu 18.04 GRUB, and ZFS is 0.8.0-rc3.

The boot process on the R415 looks something like this:

Cursor stops blinking when GRUB is first started.

Roughly 80sec passes, GRUB displays "error: unknown device 1836281449".

Roughly 80sec passes, GRUB displays "error: unknown device 1836281449".

Roughly 80sec passes, GRUB displays "error: unknown device 1836281449".

GRUB then rapidly displays a bunch of errors about compression stuff not supported, then boots.

The same kind of messages appear on the system that boots quickly, but with only a few seconds between the "unknown device" messages. It almost seems like it's waiting on something, then giving up.

djkazic commented 5 years ago

@ReimuHakurei the "compression not supported" messages are normal, I see those too. What is your /etc/default/grub config look like on the slow booting machine?

djkazic commented 5 years ago

I'd also check your fstab to see if there's cruft there to clean out. I've had hanging issues with grub device detection due to that in the past.

ReimuHakurei commented 5 years ago
acingram@lessar:~$ cat /etc/default/grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0"

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
acingram@lessar:~$ cat /etc/fstab
# UNCONFIGURED FSTAB FOR BASE SYSTEM

fstab is empty on both systems; the fast booting one has GRUB_TIMEOUT=10 and a blank GRUB_CMDLINE_LINUX but this should make no difference.

djkazic commented 5 years ago

Are you using ZFS as a boot pool? Are you using LVM or anything?

ReimuHakurei commented 5 years ago

Yes, I am using a ZFS root.

Both systems have a similar setup. There is a single SSD as rpool, then a second pool named storage. The storage pool is encrypted.

The fast booting system has 2x 8 disk raidz2 vdevs for storage, while the slow one has 2x mirror vdevs.

On both systems, I am using the new systemd mount generator with ZFS.

Both systems are set up for BIOS boot, not EFI.

jwittlincohen commented 5 years ago

This sounds like an issue I had in the past (#6264). Fabian-Gruenbichler provided the following suggestion that resolved my issue:

"This could potentially be a result of the RESUME (as in, "resume from suspend-to-disk") support in initramfs-tools, which was added pretty late in the Stretch release cycle. Try adding a new line containing "RESUME=none" to the end of /etc/initramfs-tools/initramfs.conf , followed up by running "update-initramfs -u". See "man initramfs.conf" for details."

ReimuHakurei commented 5 years ago

This actually occurs well before initramfs has a chance to do anything, but did remind me of one difference:

The fast system has 12GB of RAM and a swap zvol.

The slow system has 80GB of RAM but no swap.

I'll give it a try when I get home none the less.

ReimuHakurei commented 5 years ago

They're both using AMD Piledriver CPUs (same die, different binning/part) with similar disk controllers and chipsets.

Anyway, RESUME=none did fix my problem. Thanks!