Open sdgathman opened 1 month ago
Work around for the time being. When creating an LV, run "pvs" and select the two PVs with the most free space for the new LV. This is similar to what btrfs does, and with the use case of running vms, should get parallel action.
🤴
I've been a long time user of mdadm "raid10" (which is not actually raid1+0) on 3 disks. For those new to "linux raid10", there have been a few articles written on it, e.g. https://serverfault.com/questions/139022/explain-mds-raid10-f2
In essence, it is raid1 with a clever segment allocation scheme. Since "raid10" is commonly understood as "raid1+0", maybe LVM could have a segtype of "raid1e" or similar rather than overloading "raid10" as mdadm does.
Alternatively, LVM could have a configurable allocation policy for the LV which accomplishes something similar. Currently, allocating "raid1" LVs use all available segments in the first 2 drives before touching the 3rd drive. Is this intentional (saving 3rd PV for a spare)?
Why not just get a 4th drive?
Low end servers come with 4 drive slots. raid1+0 on 4 drives means you get 1 drive failure, and the next one has a 33% chance of destroying all data on the array (one of the 3 remaining drives is now critical). Mdadm raid10 on 3 drives plus spare means you get 1 drive failure, and md immediately brings the spare online starts syncing. When the sync is done, you get one more drive failure with no issue. This means fewer site visits.
Using 3 drives with striping allocation scheme means higher performance than raid1 on 2 drives.
Why not just use mdadm (like I have been for decades)?
mdadm does not report which sectors (even as a 1st/lst range) are affected by mismatch_cnt. LVM doesn't either, BUT, the problem is narrowed down to one LV - which is a huge win over mdadm in that respect. Large non-ECC drive caches and flaky SSDs are more and more likely to fail to report corrupted data. (Plus similar issues with a non-ECC desktop.) 256M or more of non-ECC dram is a significant risk of bit flips from cosmic rays.