check_btrfs: RAID support

ypid commented 6 years ago

Hi

When reading check_btrfs I am unsure how the check behaves if one disk is absent or goes offline in a RAID setup? Could RAID support be added?

knorrie commented 6 years ago

Hi, thanks for your feedback.

Did you actually try it, to see how it behaves? You can also create a multi-disk btrfs filesystem with some file-backed loop devices, no real extra hardware is needed for that.

If a disk dies, and starts causing read and write errors, then the check will surely let you know. Isn't that enough already?

If you're resetting the error counters and wilfullingly mount the filesystem in a degraded state with a disk missing, then I'd rather have the check report OK and be able to notify you of disk errors on a second failing disk, or alert you because you're running out of disk space on the degraded filesystem.

Can you decribe what else you would expect?

knorrie commented 6 years ago

Do you have any additional feedback, ypid? If not, I'm going to close this one soon.

knorrie commented 6 years ago

Closing because major lack of activity and because I think there's no actual problem.

ypid commented 6 years ago

@knorrie Thanks for your responses and sorry for my total absence/lake of time.

I now found the time to test this. When I create a btrfs raid1 backed by two loop devices and then umount, remove one loop device and mount in degraded mode btrfs fi show shows the following:

Label: none  uuid: 23e3d482-e2c1-4ee7-905f-7cd48ddc4630
        Total devices 2 FS bytes used 512.00KiB
        devid    1 size 4.88GiB used 2.05GiB path /dev/loop0
        *** Some devices missing

But the check still reports OK:

[<btrfs.ctree.DevItem object at 0x7163d4b16438>, <btrfs.ctree.DevItem object at 0x7163d4b164e0>]
devid 1 uuid e703e807-0879-4298-85cc-4acb16be8f47 bytes_used 2197815296 total_bytes 5242880000 path /dev/loop0
devid 1 write_errs 0 read_errs 0 flush_errs 0 generation_errs 0 corruption_errs 0
devid 2 uuid 107aca8f-afdb-46fa-8158-b79b9f3795a8 bytes_used 1358954496 total_bytes 5242880000 path /dev/loop1
devid 2 write_errs 0 read_errs 0 flush_errs 0 generation_errs 0 corruption_errs 0
BTRFS OK
Total size: 9.77GiB, Allocated: 3.31GiB (37%), Wasted: 800.00MiB (Reclaimable: 800.00MiB), Used: 640.00KiB (0%), 2 device(s)

Note that I added the output of fs.dev_info(device.devid) in the hope to be able catch this condition in the code but I did not find any way to check for this. Any idea?

(I am missing the old days where one could just check /proc/mdstat, just kidding).

knorrie commented 6 years ago

Like I described earlier, you're deliberately doing a "clean" degraded mount, which produced a very different result than having a dying disk which is starting to show read and write errors.

ypid commented 6 years ago

Depends. I already had the case where the issue was that a disk was physically detached from a raid array. I would definitely want to get alerted for that. Currently, the check does not alert on "Some devices missing" regardless how it came to the situation. I retested with a healthily mounted btrfs raid 1 and then hard removing one of the "disks" while it is mounted:

losetup -d /dev/loop1
rm /dev/loop1

knorrie commented 6 years ago

Doing that does not remove the disk from the running btrfs filesystem. You can e.g. check with lsof that the kernel still happily has the block device open and you'll see indeed that activity on the filesystem does not result in error counters going up, because there's no error. Everything is still working.

losetup -d only removes the possibility to open the file again as block device. rm /dev/loop1 prevents you from opening that path again, but, like when removing any file in linux, it has no effect on whoever already has it open.

btrfs fi show does not use the "online" functionality of the filesystem, it directly reads block devices. It just starts looking around on block devices you have and then tries to find all devices that make up a btrfs filesystem, regardless if they're mounted or not.

So the "devices are missing" only tells you that you won't be able to mount it again after you umount it, but it's useless information in this test case.

knorrie commented 6 years ago

That also means that my initial suggestion to use loop devices to simulate this was a bit useless. ;-] But yes, I also didn't test that myself until today.

However, I think you can still do this by getting device mapper involved, figuring out how to use the using the error target to get it fail.

ypid commented 6 years ago

Thanks for your feedback! Right, the rm /dev/loop1 felt abit wired and I should have checked lsof which of course still shows /dev/loop1 as being used.

So I retested this now with a physical USB flash drive directly attached via USB to a host system and created a btrfs raid1 on it together with one loop device. I created a file on the filesystem and then physically unplugged the USB drive. The result was errors being logged instantly by the kernel and also the write_io_errs value increasing which is checked for by the plugin and thus resulted in a critical state which I wanted to see.

I then tested the same again with a newly formated raid1 with the only difference that I run sync and waited a few minutes after creating the test file. Now when I unplugged the USB drive, the kernel only noticed the USB device being gone. The check still reported ok. Also reading the test file did not show any checksum issues. This is all expected as far as I remember a talk about btrfs because only the good device was accessed. Still, in monitoring it would be nice to know this instantly. I then wrote a second test file and now got the write_io_errs value increasing. I am sure that a scrub will also detect this which I run regularly.

So as you said, in practice it works and it works for me. I appreciate your hint with using device mapper to check for this. I might look into this but as the check works already this is not urgent for me. I put it on my long term todo list. Lets see.

knorrie commented 6 years ago

Good!

When reading from RAID1 chunks, there's a 50% chance ($pid%2) that it will try to read from the missing disk. (well, if you don't know the pid of the process, otherwise it's a 0% or 100% chance...)

As soon as the filesystem does any write, it'll hit all the disks when it writes the superblocks, and it will report the write errors. So, only if the filesystem is absolutely doing nothing at all, there might be a delay in errors from the monitoring.

But yes, it depends on the feedback loop of errors after reading or writing. Btrfs currently does not have some other monitoring deamon which continuously looks at kernel events of disks disappearing or other relevant things.

Have fun, Knorrie

knorrie / python-btrfs

check_btrfs: RAID support #12