kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
563 stars 243 forks source link

badblocks support on user space #499

Open axet opened 2 years ago

axet commented 2 years ago

Hello!

It would be nice to have badblocks support on btrfs. It is already described here:

My microsd card on rasberry PI failing. Some LBA sector constantly keep corrupting data, without causing any IO errors:

[/dev/mmcblk0p2].write_io_errs 0 [/dev/mmcblk0p2].read_io_errs 0 [/dev/mmcblk0p2].flush_io_errs 0 [/dev/mmcblk0p2].corruption_errs 263165505 [/dev/mmcblk0p2].generation_errs 0

After I moved to DUP data profile, all errors gone, probably because it have been masked by btrfs kernel driver and automatically corrected.

Since badblocks not yet implemented on kernel driver I have idea how to implement it on user space.

Btrfs know to support swap files. And swap files has to stick to specific sectors. We can use this feature to cover badblocks with badblocks-swap-files (size from 512bytes up to 1MB) located on top of badblocks sectors and stored inside /badblocks directory on filesystem.

Another solution is to use dm-liner to cover badblocks but it is very heavy and destructive.

EDIT: this drive reports no badblocks:

root@rpi:/home/axet# badblocks -s /dev/mmcblk0p2 Checking for bad blocks (read-only test): done

to clean all corruption dmesg spam (corruption_errs 263165505) I had only remove few files:

rm /var/log/daemon.log rm /var/log/syslog.1 rm /var/log/syslog rm /var/log/kern.log rm /var/log/messages

I guess btrfs can be smarter and mark corruption sectors to prevent corruption spam.

Forza-tng commented 2 years ago

IMHO this would be a welcome feature.

Some thoughts:

The use cases are many.

kdave commented 2 years ago

This has been discussed already, I'm not sure where, will try to find it. The general advice is to stop using such hardware as the numer of bad sectors will only grow, but there was some use case for tracking "unallocatable" blocks.

axet commented 2 years ago

@kdave for production yes. But what if you have home 2TB harddrive with few badblocks? I'm using such. I start failing in about 5 years ago, and it keep working. I'm using it for my multimedia storage. Replacing such drive is quite expensive, I'd like to keep using it until 90% of it would be covered in badblocks. It can work next 20 years or so. Hope btrfs can help me.

axet commented 5 months ago

I want to share a bit more my experience with btrfs on falling drive with few badblocks.

When I've been using btrfs on my 2TB falling drive and sometimes all IO operations are blocked because of errors and makes whole drive unusable. For example I unable to check already downloaded torrent for corruptions and I'm getting wired "Bad port" errors in a torrent client. I guess this is happening because btrfs does not recover properly with faililng sectors and random IO errors and block all reading operations completely in some specific btrfs way.

I guess making btrfs more reliable for bad drives is a good thing. Not every drive / hdd / ssd handle media IO errors correctly. And making FS unusable because of that errors is not a best practice. Even with working HDD with correct SMART device can eventually run out of spare blocks and we will face same issues as without properly working SMART devices.

ext4 can dynamically mark badblocks using 'fsck -c' command and marking sectors which causing read errors from further use. Which also causing drive not to reallocate those sectors and keep "spare" space only for replacing system critical sectors like "MBR" or first/last 1MB sectors.

Zygo commented 5 months ago

btrfs does not recover properly with faililng sectors and random IO errors and block all reading operations completely in some specific btrfs way.

Are you sure it's btrfs and not the drive? A lot of consumer HDD firmware is configured to retry read IOs for as long as 2 minutes per block. As drives age, they are not able to track spindle rotation as well as a new drive, and the symptom is slower data rates in specific regions of the drive, sometimes thousands of times slower (not to mention that if the media has decayed to the point of UNC sectors, that decay may be ongoing). Technically the data may still be readable, but the drive is useless due to the degrading IO speed.

Check with iostat -x 10, watch for big numbers in the r_await column to see if this is happening.

You could hack up badblocks (if someone hasn't done it already) to measure IO times and mark any sector with a long access time as bad (similar to what dd_rescue does) but you'll have to keep updating the bad blocks map over time as more of the media degrades beyond the drive's ability to read.

Zygo commented 5 months ago

For SD cards (and some low-end SSDs), bad block maps are entirely useless.

Bad block maps are based on LBAs to locate media defects, but with every write, a SD device will continually relocate LBAs to different parts of the media. If the media is failing, data that is mapped to the failing media locations will be corrupted. It does not matter where the filesystem directs writes, since it has no control over which parts of the underlying media will be used to store new data, or where existing data may be relocated by some future write operation regardless of the write's location. With a failing SD card, any write can result in the corruption of any data on the device.

Media faults are always moving to new LBAs on a SD card and the number of defects typically increases over time, so if scrub isn't detecting defects today, it's not looking at the parts of the media where the defects are (i.e. the defects have been relocated to free space LBAs). Eventually some write will place some data LBA on defective media and a csum mismatch will appear somewhere in the filesystem in an endless, unwinnable game of whack-a-mole.

The day to replace your SD card (and consider switching to a different SD card vendor or model more suitable for your application) is the day the first csum error is detected in btrfs. Or use dup metadata and data profiles to automate the game of whack-a-mole for a few months until the moles inevitably win.

badblocks cannot detect SD card defects unless the SD device is capable of detecting and reporting them, but most aren't, so badblocks often reports 0 errors on failing SD cards. In the best possible case, with a destructive read-write test, badblocks might give a list of LBAs that will become out of date over time (and maybe also reduce the card's performance or lifespan even further).

badblocks support can work on HDDs that last a long time without degrading performance, and on SSDs where the firmware does not allow known bad media sectors to be mapped to new LBAs. For SD cards that can't distinguish between working and failing media sectors, there's no hope of getting badblocks to work.

axet commented 5 months ago

Since media usually reports bad IO errors via SMART as corrupted sectors, unable to read sectors I assume this is BTRFS behavior. I see "bad port" error message, which seems like btrfs specifics.

I have old (25000 hours) HDD, which is having badblocks. You can see it from my 2 years old comment, drive keep working. I was using BTRFS on that drive. And turns out BTRFS not well designed to work on failing drives. Because my user experience shows a lot of issues related to not properly handling badbloks on a device. Some times you have to reboot to let to write to the device, sometimes it wont let you read directory and you have to remove it at whole and re-download. Feels buggy.

I think this is wise not to write sectors know to be bad, regardless if drive support proper spare mapping or not.

And today I decide to move out from BTRFS to ext4 since this filesystem support badblocks dynamically with 'fsck -c' flags. So I can keep using this drive, using badblocks and simply ignore bad sectors not trying to relay on SMART. ext4 only bad because it has no data CRC, no duplicate metadata and no online shrinking. So I'm not happy about decision.

Ideal drive should let you mark badblocks manually so you can keep using the device until here is no working sectors left. Using SMART command should be possible to set first and last sectors (at least to keep MBR based drives to work) as well as sectors to ignore by keeping bad sectors table on the drive it self. But right now SMART works differently and only allow to recover badsectors until "hidden" spare space is available.

[51079.665939] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x50000 action 0x0
[51079.665943] ata8.00: irq_stat 0x40000001
[51079.665944] ata8: SError: { PHYRdyChg CommWake }
[51079.665946] ata8.00: failed command: READ DMA EXT
[51079.665946] ata8.00: cmd 25/00:00:90:e3:34/00:0a:6e:00:00/e0 tag 16 dma 1310720 in
                        res 51/40:00:30:e4:34/00:00:6e:00:00/00 Emask 0x9 (media error)
[51079.665950] ata8.00: status: { DRDY ERR }
[51079.665950] ata8.00: error: { UNC }

https://paste.debian.net/1319240/

EDIT: I only know one software which showing sector reading speed but it works on dos / windows only, no linux support it called "Victoria HDD/SSD"

Zygo commented 5 months ago

But right now SMART works differently and only allow to recover badsectors until "hidden" spare space is available.

There are two limits: one is the difference in space between the nominal drive size and the number of good sectors. That's set at the factory based on how many bad sectors were found after manufacturing, so some drives have hundreds of spare sectors while others have tens of thousands. The second limit is the size of the reallocation table itself. That limit is much harder to raise, as the whole table has to be kept in drive memory (or the hard drive has to include some solid state storage for the table, as some of the larger WD drives now do).

Software layers that can emulate this in the Linux host already exist. dm-linear looks like exactly what we want:

If we have a method to translate a badblocks list into a dmsetup table, we should have all we need to adapt the device to meet btrfs's error-handling requirements. That brings me back to the OP:

Another solution is to use dm-[linear] to cover badblocks but it is very heavy and destructive.

Neither "heavy" nor "destructive" seems to be true. What am I missing here?

I made a PoC to translate a badblocks list into a dmsetup table for an existing filesystem in a few minutes. It works the same way hard disk firmware remapping does, mapping each new defective block to a reserve area in the same device while leaving all other blocks unchanged. The tricky and boring bits are finding a place to store the table and keep it up to date as defects grow.

I can't find an existing tool like this with a quick github search. I find a lot of the opposite: tools using dm modules to simulate bad blocks on working devices, and a lot of people asking for badblocks support in random filesystems (they don't find an existing tool like this either).

Does everyone who solves this problem roll their own tool and then not tell anyone about it? Or am I using the wrong search words?

I decide to move out from BTRFS to ext4 since this filesystem support badblocks dynamically with 'fsck -c' flags.

If you really need this to work in the filesystem level, ext4 is the only modern filesystem that still supports this legacy feature. xfs, zfs, btrfs, bcachefs, and ntfs (on Linux) were designed after badblocks support in filesystems became redundant, either implemented in hardware or in lower block layers like dm-linear. If your drive has stable bad blocks, and your ambitions are limited to emulating a legacy e2fsck feature, then a dm device configured to remap all the bad blocks should be able to hold a working btrfs on top, and it should be possible to insert this layer on top of an existing filesystem with at most one reboot.

Thinking aloud, some integration into the filesystem might be useful to get capabilities beyond what ext4 does, particularly if the bad blocks list is not stable and additional grown defects need to be remapped. The filesystem can provide a list of bad sectors it has discovered to the user to be added to the dm-linear map, or btrfs could do the mappings itself. That might be something the raid-stripe-tree is good for: a way around the "both metadata writes fail -> go to reboot" issue. If a pair of metadata writes fails, the raid-stripe-tree might be able to relocate the blocks to a new, less-failing location, without aborting the transaction and forcing read-only. Grossly oversimplified, we'd treat a drive with bad sectors as a weird kind of HM-SMR drive.

axet commented 5 months ago

By "heavy" I mean a lot of hand work, not performance. I guess bad word choosing. By "destructive" I mean you have to reformat upper laying fs for a new dm-liner size every time you remove a bad block or you have to keep spare sectors table somewhere else \ another drive (this comes to "heavy" / annoying).

Here is no good filesystem for degrading HDD/SSD. ext4 looks much stable and user friendly for that purpose than btrfs.

EDIT: SMART issues I referring here only as example, to show SMART is not perfect and do not solve all issues, and even working drive with good firmware still can have issues (running out of spare sectors, bad MBR) which CAN be addressed by filesystem.

axet commented 5 months ago

I didn't say that all this HDD bad blocks related issues I'm talking about have been caused by HDD of 32000 hours and 24 relocated sectors on disk. Not much of a bad sectors.

https://linux-hardware.org/?probe=3ce66e47b9&log=smartctl

After I finished with migration process from btrfs to ext4 (using no additional drives, I relocate data by small amount from one partition to another) , HDD showing 0 pending sectors and 24 relocated sectors. Device seems fine. But my experience with how often BTRFS showing READ-ONLY and my torrent client "Bad ports" errors tells me that BTRFS does not handle bad blocks very well.

I hope it will be addressed.

Zygo commented 5 months ago

By "heavy" I mean a lot of hand work, not performance. [...] By "destructive" I mean you have to reformat upper laying fs for a new dm-liner size every time you remove a bad block or you have to keep spare sectors table somewhere else \ another drive (this comes to "heavy" / annoying).

Yeah, the obvious first step is to automate the hand work. We'd need a tool in initramfs to do this for the root device, but once that's done, it should be no harder to manage than mdadm/lvm/bcache/etc. Users have been asking for this in every filesystem forum I can find, over a period spanning decades, and one tool could add the support to every filesystem at once.

The only destruction required is to shorten the upper layer device to make a pool of remap targets at the end of the lower layer device (and of course the data in the remapped sectors is lost, but they're bad sectors so there wasn't any data in them to lose anyway). btrfs can shrink online, and ext4 can shrink offline, so making space shouldn't be a problem (other filesystems without shrinking capability might find this harder to deal with). The reserved allocation can be done in batches, say 10000 sectors per batch, so you don't need to change the upper layer size until a huge number of sectors fail.

BTRFS showing READ-ONLY

That's expected behavior when:

  1. all redundant copies of a metadata write fail (if only some copies fail, the filesystem keeps going with one of the working copies), or
  2. a write operation depends on a failing metadata page read that is uncorrectable.

ext4 can use the badblocks list to avoid placing metadata on known bad blocks, so it never hits these cases when the list of bad blocks is stable. ext4 still forces read-only or reboot if a metadata write fails on an unknown bad block, i.e. one the filesystem was not told to avoid in advance.

If the set of bad blocks is stable (as in your case), you can avoid the read-only transitions today using a dm-linear mapper to avoid presenting the bad blocks to the filesystem. That will work with every filesystem, so it also works with btrfs.

If the set of bad blocks is not stable (which is not your case), btrfs can do better than ext4.

In case 1, btrfs can continue without going read-only if it can complete the write at some other location. That's the raid-stripe-tree idea I posted above: allocate a new block for the metadata and try writing again, then if that works, add a redirection for the failed block so that parts of the tree that have already been written are still valid. That minimizes the impact on the rest of the btrfs transaction code, and leverages existing work that has been done on the stripe tree since this issue was opened in 2022.

Grossly oversimplified, raid-stripe-tree is a built-in version of dm-linear for zoned drives, and a zoned drive is a drive where every sector is bad except the one after the last written sector. btrfs should be getting closer to being able to put these together to handle grown-defect bad sectors on CMR drives too...if anyone's still using CMR drives by the time it's implemented.

For case 2, always use dup or raid1* metadata. One bad sector under a metadata page destroys the filesystem when using the other profiles. With dup metadata the filesystem can survive any bad sectors that aren't spaced exactly 512MiB apart.

axet commented 5 months ago

Interesting. Few things are missing here:

1) btrfs supports online-shrinking which means badblocks list can be altered without reboots! But not sure if dm-linear support dynamic updates (maybe we need additional kernel patches for that) 2) here is no way reporting badblocks back to FS 3) metadata write errors will be missed and not recorded to the bad blocks list. Maybe only to dmesg. 4) can dm-linear be counted as destructive method? since we have to reformat it all for adding dm-linear layer.

Zygo commented 5 months ago
  1. dm-linear can be reloaded (use dmsetup load upperdev instead of dmsetup create upperdev). LVM relies on this to move live volumes around. It's mature code that has been running for decades.
  2. correct, but that's the point: it remaps bad sectors so the filesystem does not have to be aware where they are (same as the remapping function in a hard drive)
  3. a btrfs scrub will identify bad sectors that are currently part of the filesystem (not in free space) and not yet remapped. The block locations will go to dmesg. It would be nice if we could get that information out of the block layer directly somehow, to make the emulation of HDD sector remapping more complete.
  4. no reformat is necessary. dm-linear just glues together slices of existing block devices. It doesn't care about anything that is on the devices.

That said, it would be a good idea to put a distinct superblock on the lower device so udev can find it, and hide the lower device's existing filesystem superblocks so udev can't find them through the lower device any more. We can do that on an existing filesystem by moving only a few blocks out of the way.

The easiest way to do that: remap the lower device's superblocks as if they were bad sectors, and copy the superblocks from the lower device to the upper device. Also remap our remapper tool's superblock the same way.

After that, the lower device no longer has the filesystem superblock, but the upper device looks exactly like a shorter version of the original lower device.

https://github.com/g2p/blocks uses a similar approach to convert an existing filesystem to lvm in-place.

axet commented 5 months ago

With all of that done would extend HDD life! Since SMART would only relocate sectors from crucial part of HDD (first and last MB or even less - MBR only) that would result in longer spare table consumption since it no longer need replace sectors on the drive it self.

axet commented 4 months ago

I wrote simple python script which showing sectors read timings in ns. I never seen anything done for linux for some reason. Script read sector by sector (you need modern CPU so it will not affect reading and measuring speeds) you to scan whole disk area and showing per sector delays in ns.

After 10 runs, my HDD died :) Be careful running it with old drives. I can see my 2TB drive had only one bad area in about 70% of a drive and rest is working fine. I guess some metallic dust, got scratched from bad area and caused drive to failed to turn on again after 10 tests.

Normal drive should show 5000-10000 ns per sector. Failing drive can show 1.5 seconds per sector and 3seconds - 60 seconds per bad block.

https://gitlab.com/axet/homebin/-/blob/debian/hddtest.py?ref_type=heads