kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

`Unallocated` & `Data` extremely uneven across 3 drives (14TB, 14TB, 12TB) #686

Closed psla closed 9 months ago

psla commented 9 months ago

This may be working as intended; but to me it doesn't sound like, based on various reddit threads.

  1. I started with 2x14TB drives in RAID-1 array. I loaded a lot of data on them ~(10TB).
  2. I observed nice uniform allocation, of ~10.4TB each.
  3. I run duperemover. It deduped some :-)
  4. I then executed btrfs device add /dev/mapper/btrfs3 /mnt/nas to add a 12TB drive to this array. This resulted in the drive being added, but completely unallocated. Best to my understanding, that would imply that (without any additional rebalancing ) max usable free space would be approx 4.6TB (2.3 + 2.3 from btrfs1 & 2 mirrored to btrfs3);
    
    ~# btrfs filesystem usage  /mnt/nas
    Overall:
    Device size:                  36.38TiB
    Device allocated:             20.86TiB
    Device unallocated:           15.52TiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         20.16TiB
    Free (estimated):              8.11TiB      (min: 8.11TiB)
    Free (statfs, df):             2.65TiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,RAID1: Size:10.40TiB, Used:10.05TiB (96.66%) /dev/mapper/btrfs1 10.40TiB /dev/mapper/btrfs2 10.40TiB

Metadata,RAID1: Size:37.00GiB, Used:34.39GiB (92.95%) /dev/mapper/btrfs1 37.00GiB /dev/mapper/btrfs2 37.00GiB

System,RAID1: Size:8.00MiB, Used:1.44MiB (17.97%) /dev/mapper/btrfs1 8.00MiB /dev/mapper/btrfs2 8.00MiB

Unallocated: /dev/mapper/btrfs1 2.30TiB /dev/mapper/btrfs2 2.30TiB /dev/mapper/btrfs3 10.91TiB


5. So I have run a balance `btrfs balance start -dusage=5 /mnt/nas`, but that didn't give much (not many incomplete blocks)
6. So I forced full rebalance, including metadata: `btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/nas`

The rebalance finished, but now the data is skewed to the smallest disk:

~# btrfs filesystem usage /mnt/nas Overall: Device size: 36.38TiB Device allocated: 20.33TiB Device unallocated: 16.05TiB Device missing: 0.00B Device slack: 0.00B Used: 20.33TiB Free (estimated): 8.03TiB (min: 8.03TiB) Free (statfs, df): 7.43TiB Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no

Data,RAID1: Size:10.13TiB, Used:10.13TiB (100.00%) /dev/mapper/btrfs1 5.27TiB /dev/mapper/btrfs2 5.28TiB /dev/mapper/btrfs3 9.71TiB

Metadata,RAID1: Size:35.00GiB, Used:33.45GiB (95.56%) /dev/mapper/btrfs1 28.00GiB /dev/mapper/btrfs2 27.00GiB /dev/mapper/btrfs3 15.00GiB

System,RAID1: Size:32.00MiB, Used:1.42MiB (4.44%) /dev/mapper/btrfs1 32.00MiB /dev/mapper/btrfs2 32.00MiB

Unallocated: /dev/mapper/btrfs1 7.43TiB /dev/mapper/btrfs2 7.43TiB /dev/mapper/btrfs3 1.19TiB


I decied not to worrry; some random redditor said that btrfs will allocate chunks starting with the devices with the most free space. Well, I was wrong. I just moved some data to this volume, and 0.18TB was used from the `btrfs3`

~# btrfs filesystem usage /mnt/nas Overall: Device size: 36.38TiB Device allocated: 20.69TiB Device unallocated: 15.69TiB Device missing: 0.00B Device slack: 0.00B Used: 20.64TiB Free (estimated): 7.87TiB (min: 7.87TiB) Free (statfs, df): 7.36TiB Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no

Data,RAID1: Size:10.31TiB, Used:10.29TiB (99.79%) /dev/mapper/btrfs1 5.36TiB /dev/mapper/btrfs2 5.37TiB /dev/mapper/btrfs3 9.89TiB

Metadata,RAID1: Size:36.00GiB, Used:33.79GiB (93.87%) /dev/mapper/btrfs1 29.00GiB /dev/mapper/btrfs2 28.00GiB /dev/mapper/btrfs3 15.00GiB

System,RAID1: Size:32.00MiB, Used:1.44MiB (4.49%) /dev/mapper/btrfs1 32.00MiB /dev/mapper/btrfs2 32.00MiB

Unallocated: /dev/mapper/btrfs1 7.34TiB /dev/mapper/btrfs2 7.34TiB /dev/mapper/btrfs3 1.01TiB



Reading some big warnings about running of disk space (especially for metadata) I am worried that `btrfs3` is disproportionally used (it's the smallest disk in the array, but it is used the most).

1. Is the current behavior working as expected? It certainly does not match the behavior some redditor desribed :) And it feels a little bit awkward. I could not find any official documentation on how the allocator works.

I am currently reporting this as a bug, but it's entirely possible that this is Working As Intended.

3. Should I worry? Should I run balance periodcically, and at what point `balance` should realize that it's running of disk space on btrfs3 (for most part I expect the blocks to be almost always full, because I hardly ever delete data from these drives)? Is there a risk that btrfs will run out of metadata storage (on btrfs3), or will btrfs realize that it can still store metadata on btrfs1 & btrfs2 and ignore writing got btrfs3 completely?

I apologize if this is not the right forum for reporting this behavior and asking questions; happy to move it somewhere else.

My mount options: `auto,noatime,compress=zstd,space_cache=v2`
Kernel: `6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux`
psla commented 9 months ago

It seems like basically btrfs chooses to allocate space on btrfs3 and then distributes the parity / second copy to btrfs1 and btrfs2; which is directly in opposite to what the reddit posts describe :)

E.g. this: https://www.reddit.com/r/btrfs/comments/xvqa6a/comment/ir4mo49/?utm_source=share&utm_medium=web2x&context=3

Time2: 
Unallocated:
   /dev/mapper/btrfs1      7.24TiB
   /dev/mapper/btrfs2      7.24TiB
   /dev/mapper/btrfs3    833.98GiB

Time3:
Unallocated:
   /dev/mapper/btrfs1      7.02TiB
   /dev/mapper/btrfs2      7.02TiB
   /dev/mapper/btrfs3    384.98GiB

It's clearly visible that for ~440GB used from btrfs3 approx 220GB were used from each btrfs1 and btrfs2.

So the question remaining: what will happen once 384GB are used on btrfs3?. Well.... Turned out I had more data to move, so I risked it :)

# btrfs filesystem usage  /mnt/nas
Overall:
    Device size:                  36.38TiB
    Device allocated:             23.11TiB
    Device unallocated:           13.27TiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         23.08TiB
    Free (estimated):              6.65TiB      (min: 6.65TiB)
    Free (statfs, df):             6.65TiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 48.00KiB)
    Multiple profiles:                  no

Data,RAID1: Size:11.51TiB, Used:11.50TiB (99.88%)
   /dev/mapper/btrfs1      6.07TiB
   /dev/mapper/btrfs2      6.07TiB
   /dev/mapper/btrfs3     10.89TiB

Metadata,RAID1: Size:40.00GiB, Used:39.56GiB (98.91%)
   /dev/mapper/btrfs1     33.00GiB
   /dev/mapper/btrfs2     32.00GiB
   /dev/mapper/btrfs3     15.00GiB

System,RAID1: Size:32.00MiB, Used:1.61MiB (5.03%)
   /dev/mapper/btrfs1     32.00MiB
   /dev/mapper/btrfs2     32.00MiB

Unallocated:
   /dev/mapper/btrfs1      6.63TiB
   /dev/mapper/btrfs2      6.63TiB
   /dev/mapper/btrfs3      7.98GiB

So btrfs3 got full and btrfs1 and 2 are getting filled. Nothing seems to have crashed.

Closing this, as it probably works as intended.

Zygo commented 9 months ago

This behavior is definitely not correct, and it looks very similar to a regression I found in kernel 6.0. It can result in some of the space being unusable (although probably not more than a few dozen GiB or so).

I posted a patch to fix it.