koverstreet / bcachefs

Other
696 stars 72 forks source link

[Linux 6.7.1] data not replicating, despite being told to replicate data. #644

Open LRFLEW opened 9 months ago

LRFLEW commented 9 months ago

I'm running NixOS with the Linux 6.7.1 kernel. For reference, I formatted my drives with the following arguments:

sudo bcachefs format --label=store.hdd0 /dev/sda1 --label=store.hdd1 /dev/sdb1 --label=store.hdd2 /dev/sdc1 --label=store.hdd3 /dev/sdd1 --label=store.hdd4 /dev/sde1 --label=cache.ssd0 --durability=0 --discard /dev/sdf3 --metadata_replicas=2 --data_replicas=2 --metadata_replicas_required=2 --data_replicas_required=2 --erasure_code --background_compression=zstd:5 --foreground_target=cache --promote_target=cache --background_target=store --fs_label=HomeRAID

I uploaded ~1TB of data to the drives and monitored bcachefs fs usage -h to see how it was replicating.

After I ran sync after the last of the file writes, I checked the usage, and saw that the majority of the data written was just to store.hdd0 (showing ~700GBs), and rebalancing was slowly spreading the data around to the other drives. Once the re-balancing was complete, I checked the usage again, and saw that there was ~200GB per drive. Looking at the breakdown at the top of the usage output, the majority of the "user" data has only one replication, with only <30MB shown being replicated for each of the 10 pairs of disks. Running bcachefs rereplicate seems to trigger the replication to actually occur, though usage still show about ~10MB unreplicated per drive (similar to the amount replicated before running rereplication).

LRFLEW commented 9 months ago

Update: This issue has resulted in data loss for me. The drive store.hdd0 is slowly dying, and currently is preventing the volume from mounting properly. Luckily I didn't put anything of importance on there that wasn't already on another device, so the data loss has been non-critical. However, I would have expected the array to simply end up in a degraded state from a drive loss given the format options, so the fact that it didn't seems like a critical issue.

All the drives in this array are older drives (10+ years old), so having a drive fail is not entirely unexpected for this system. However, the drive that failed is also the drive that I noted seemed to be getting the entirety of the first writes, so it's likely that part of this issue resulted in excess wear on this one drive that caused it to fail first.

EDIT: So after rebuilding the array without the misbehaving drive, I tried zeroing the drive, expecting the drive to fully die during that process. However, the drive actually zeroed without any problems, and now seems to be working ok. I've re-added it to the array (after also zeroing the other drives in sequence just in case), and I'm gonna test it more. Right now I have two theories as to what happened:

  1. This was indeed a hardware failure, but was restricted to a few specific sectors. When the bcachefs tried to read from the failed sectors, it caused the drive to freeze up. However, zeroing the drive allowed it to discard the failed sectors and re-provision / re-map the sectors to avoid the failure.
  2. There was no hardware failure, and instead some sort of data corruption of the partition caused the bcachefs driver to lock up. The drive as a whole seemed to have poor responsiveness (eg. getting SMART data), so the bcachefs driver would have had to also prevent other parts of the kernel from properly accessing the drive.

Since I've wiped the drive, there's no way for me to now check to see if there was any data corruption to test theory two. In either case, removing one of the drives should have worked, since I specified the partition to use redundancy, which is relevant for this issue. However, it's possible I encountered another bug that was actually causing the failure. It might be worth looking into it, but until someone (else) encounters the issue and saves the drive's data, there's probably not much that can be done to look into it.