Disks never goes to sleep

multi commented 5 months ago

System information

Type	Version/Name
Distribution Name	Arch
Distribution Version	-
Kernel Version	6.7.11-hardened1-1-hardened
Architecture	x86_64
OpenZFS Version	2.2.99.r398.g39be46f43f

Describe the problem you're observing

That's a follow up from https://github.com/openzfs/zfs/issues/16050#issuecomment-2034156879

After updating the kernel to 6.7.11-hardened and zfs to 39be46f I've got a kernel null pointer deref.

I've issued a scrub command and after that (plus few/reboots and kernel/zfs boot combos) - two (mostly, sometimes all) of the disks never goes to sleep.

Last 2 days

Last 7 days

Tried 6.7.9-hardened + 8f2f6cd + zfs_vdev_disk_classic=0 - just to confirm, if all disks goes to sleep as it should be (that was working before yesterday) - and no. The same two disks (from a raidz2 of 6) never goes to sleep :/

robn commented 5 months ago

@multi thanks for reporting this, greatly appreciated!

To confirm, with zfs_vdev_disk_classic=1 (or on 2.2.x) it works the way you'd expect?

Tell me more about "doesn't go to sleep". How do you normally tell whether the disk is asleep or not (actual command)? Is there some program or job that runs to spin down disks, or do they go to sleep when they're idle?

Do you have any metrics showing IO to the disks during those periods? iostat -yxd 1 and zpool iostat -vl 1 are the kind of thing I'd like to see.

My guess all together is that there's something holding the drive open and/or actively issuing or waiting for IO in those periods such that the disk doesn't sleep, but I don't have much mental model for what my cause that, so if I can understand what you've got happening and maybe reproduce it, I can dig deeper.

Thanks!

multi commented 5 months ago

@robn I thank you!

To confirm, with zfs_vdev_disk_classic=1 (or on 2.2.x) it works the way you'd expect?

No, none of the combos kernel/zfs/zfs_vdev_disk_classic works as "expected" at the moment. Before yesterday, I was running 6.7.9-hardened + 8f2f6cd and nothing in the kernel args for zfs_vdev_disk_classic - the disks power mode was behaving as it should be normally.

Tell me more about "doesn't go to sleep". How do you normally tell whether the disk is asleep or not (actual command)?

smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda

It started to show Device is in ACTIVE or IDLE mode for two of the disks, instead of Device is in IDLE_A mode (for eg.)

Also, hddtemp /dev/sd{e..f} started to show the temperature, instead of drive is sleeping

Do you have any metrics showing IO to the disks during those periods? iostat -yxd 1 and zpool iostat -vl 1 are the kind of thing I'd like to see.

That's from telegraf/diskio (sde + sdf - are the two disks that are active, sd{a..d} - are fine)

Tha's from telegraf/zpool_influxdb

I've tried running zpool iostat -vl 1 it shows some values only on the first loop, then everything is zeros. iostat -yxd 1 also shows zeros

multi commented 5 months ago

sdf just fall asleep ... Will keep you informed what's going on with the last remaining active one - sde.

robn commented 5 months ago

That looks like scrub traffic. Given its start-of-month, could it just be a monthly scrub task? With crashes, maybe it restarted or got delayed a few times?

multi commented 5 months ago

I've started a scrub manually yesterday (because of https://github.com/openzfs/zfs/issues/16050)

zpool status -v shows scan: scrub repaired 0B in 07:29:15 with 0 errors on Tue Apr 2 14:19:32 2024 - if there was a scrub running at the moment it will be another message, right?

Given its start-of-month, could it just be a monthly scrub task?

No, it's not a scheduled one

With crashes, maybe it restarted or got delayed a few times?

For sure, it restarted few times yesterday

multi commented 5 months ago

And sde fall asleep...

I'll boot again kernel 6.7.11-hardened and zfs master + the patch from https://github.com/openzfs/zfs/commit/1c22ed4549e6dd9e8251420ed495a6f1979884ea to see, if that's going to change something in the disks power mode behaviour

multi commented 5 months ago

Rebooted, all disks are sleeping now (as it should be). So, maybe it's not related to the changes here https://github.com/openzfs/zfs/compare/8f2f6cd...39be46f

Not sure, if it was a delayed scrub (but, no signs from it in zpool status -v). Feel free to close that issue. If you don't have any other ideas/questions. I'll reopen it on odd behaviour again :)

openzfs / zfs

Disks never goes to sleep #16054

System information

Describe the problem you're observing