Open LRFLEW opened 9 months ago
Update: This issue has resulted in data loss for me. The drive store.hdd0
is slowly dying, and currently is preventing the volume from mounting properly. Luckily I didn't put anything of importance on there that wasn't already on another device, so the data loss has been non-critical. However, I would have expected the array to simply end up in a degraded state from a drive loss given the format options, so the fact that it didn't seems like a critical issue.
All the drives in this array are older drives (10+ years old), so having a drive fail is not entirely unexpected for this system. However, the drive that failed is also the drive that I noted seemed to be getting the entirety of the first writes, so it's likely that part of this issue resulted in excess wear on this one drive that caused it to fail first.
EDIT: So after rebuilding the array without the misbehaving drive, I tried zeroing the drive, expecting the drive to fully die during that process. However, the drive actually zeroed without any problems, and now seems to be working ok. I've re-added it to the array (after also zeroing the other drives in sequence just in case), and I'm gonna test it more. Right now I have two theories as to what happened:
Since I've wiped the drive, there's no way for me to now check to see if there was any data corruption to test theory two. In either case, removing one of the drives should have worked, since I specified the partition to use redundancy, which is relevant for this issue. However, it's possible I encountered another bug that was actually causing the failure. It might be worth looking into it, but until someone (else) encounters the issue and saves the drive's data, there's probably not much that can be done to look into it.
I'm running NixOS with the Linux 6.7.1 kernel. For reference, I formatted my drives with the following arguments:
I uploaded ~1TB of data to the drives and monitored
bcachefs fs usage -h
to see how it was replicating.After I ran
sync
after the last of the file writes, I checked the usage, and saw that the majority of the data written was just tostore.hdd0
(showing ~700GBs), and rebalancing was slowly spreading the data around to the other drives. Once the re-balancing was complete, I checked the usage again, and saw that there was ~200GB per drive. Looking at the breakdown at the top of the usage output, the majority of the "user" data has only one replication, with only <30MB shown being replicated for each of the 10 pairs of disks. Runningbcachefs rereplicate
seems to trigger the replication to actually occur, though usage still show about ~10MB unreplicated per drive (similar to the amount replicated before running rereplication).