`btrfs balance status` giving nonsensical information

rrueger commented 11 months ago

Running Ubuntu 20.04.6 LTS with kernel 5.4.0-153-generic

# btrfs balance status
35 out of about 101 chunks balanced (466 considered),  65% left

What does it mean, to have considered 466 chunks out of 101?

Zygo commented 11 months ago

466 chunks were considered by the balance filter so far (out of the total number of chunks on the filesystem, which was not printed). 101 chunks matched the filter criteria. 36 of the matching chunks have been relocated. 65% of the 101 chunks are left to be relocated (100 - (36 * 100 / 101) = 65).

It's kind of strange to report "considered" at all, other than to maintain compatibility with the output of previous releases. It's not a particularly useful thing to know at best, and the vaddr filter gives it values that aren't particularly useful ("number of chunks that are above the high end of the vaddr range").

rrueger commented 11 months ago

Thank you. I have a follow-up question

The command that I executed was

btrfs balance start --full-balance -fv --bg -mconvert=raid10 /btrfs

I believe that this balance then should only operate on metadata chunks?

My drive usage looks like this

Overall:
    Device size:          19.10TiB
    Device allocated:         13.96TiB
    Device unallocated:        5.14TiB
    Device missing:          0.00B
    Used:             13.69TiB
    Free (estimated):          2.70TiB  (min: 2.70TiB)
    Data ratio:               2.00
    Metadata ratio:           2.00
    Global reserve:      512.00MiB  (used: 6.80MiB)

             Data    Metadata  System               
Id Path      RAID10  RAID10    RAID10    Unallocated
-- --------- ------- --------- --------- -----------
 1 /dev/sdd1 3.44TiB  45.75GiB         -   158.24GiB
 2 /dev/sdi1 3.44TiB  45.25GiB  32.00MiB   158.71GiB
 3 /dev/sde1 3.44TiB  45.75GiB  32.00MiB   158.21GiB
 4 /dev/sdg1 3.44TiB  45.25GiB  32.00MiB   158.71GiB
 5 /dev/sdf1       -  25.00GiB  32.00MiB     4.52TiB
-- --------- ------- --------- --------- -----------
   Total     6.88TiB 103.50GiB  64.00MiB     5.14TiB
   Used      6.75TiB  96.52GiB 896.00KiB

According to the wiki, a metadata chunk is 1GiB (for filesystems over 50GiB).

So, if I have understood correctly, there should be only ~103 metadata chunks in total? This appears to match the about 101

The balance operation has been running for 43 hours, and now the status is

47 out of about 101 chunks balanced (639 considered),  53% left

If it really has operated on 639 chunks, this means it has taken ~4minutes per chunk. This seems reasonable? If it has only needed relocated 47 chunks, it would mean almost an hour per chunk. This seems extremely slow?

The disks are fairly speedy usually (according to iozone). There is no other IO going on.glances and iotop report that R/W speeds are sitting around 2MB/s (per disk) which again seems very slow.

To summarise

It appears as though btrfs balance is somehow "considering" all chunks, even though it should be only looking at metadata chunks.
It is reading at very, very slow speeds.
btrfs filesystem usage shows that the balance operation is only relocating metadata chunks.

Is there anything I can do about this?

After the metadata balance, I will be performing a data balance.

When I started (and then cancelled) a data balance, btrfs reported that there were ~3700 chunks (even though there are 3.44*2 = 6.88 TiB of data, which would mean 6880 chunks at 1GiB?).

If I were to extrapolate the times from the metadata balance, this means it would take ~5 months to balance the data?

Zygo commented 11 months ago

I believe that this balance then should only operate on metadata chunks?

Yes (including the system chunk, which is a special kind of metadata).

According to the wiki, a metadata chunk is 1GiB (for filesystems over 50GiB).

Most chunks will be that size if space is available; however, there can be different metadata chunk sizes (note the fractional GiB in your btrfs fi usage output above).

If it has only needed relocated 47 chunks, it would mean almost an hour per chunk. This seems extremely slow?

Metadata is very slow to balance. Each extent in metadata is 16KiB, so a full chunk is 65536 extents, which will take a significant portion of an hour on spinning drives. If you have a very slow CPU that can be a factor too.

The raid10 profile for metadata is likely not helping. raid0 striping helps with large reads that can be distributed over multiple drives in parallel; however, metadata reads are small, random, and not parallelizable. A walk down the tree could face a separate seek cost from every drive in the filesystem, as the striping would spread logically adjacent metadata pages across multiple drives, and the tree walk has to wait for each drive sequentially because it can't predict the locations of child pages until after it has read the parent page. In raid1 profile these reads would mostly be issued to the same device, and likely satisfied from the drive's buffers or with a relatively low seek cost.

60 minutes per chunk is 2-3x the numbers I get, which would be consistent with the difference between raid10 vs raid1 metadata.

Is there anything I can do about this?

The easiest fix for metadata balance performance is to never balance metadata. Metadata balances are only useful in specific cases, and have no beneficial effect (and some undesirable effects) otherwise. The general rule is "never balance metadata except when doing a raid profile conversion", but here you are doing a raid conversion (4-device raid10 to 5-device raid10) so it's possible for the metadata balance to have a beneficial effect.

You could balance the data only, leaving the existing metadata where it is (i.e. unbalanced). The data balance would ensure equal amounts of unallocated space on each device are available for future metadata allocations. The existing metadata would be unequally distributed, but still distributed over 4 of the 5 drives. Any change--better or worse--from redistributing the existing metadata over the 5th device would be small, so by not balancing the metadata you lose a small potential benefit and also avoid a small potential risk.

Balancing metadata first after adding a new device will place one stripe from every chunk of metadata on the new device, creating a hot spot for all metadata. I don't know of a way to fix that without first balancing all of the data, then running metadata and data balances interleaved together (i.e. balance one metadata chunk, then one data chunk to fill in the hole left by the metadata, repeat) until the metadata is equally distributed.

The best approach for this filesystem would have been to run a full balance (i.e. data and metadata together in original allocation order) but a full balance now would not produce the desired effect because all the metadata has been moved to the "front" of balance order.

If I were to extrapolate the times from the metadata balance

Data balance times vary widely. Extrapolating from metadata won't produce a valid result. Even extrapolating from a subset of data blocks won't necessarily produce a valid result. There can be 5 orders of magnitude between the fastest and slowest data chunk even within a single filesystem.

Balance time is proportional to the number of extent references (the data itself, plus reflinks created by copies, dedupe, and snapshots) in the chunk. Data extents range in size from 4KiB to 128MiB. If a data chunk has a few large extents with one reference each, a data chunk may be balanced in under 10 seconds. If a chunk has hundreds of thousands of small extents with thousands of references to each one, that data chunk may require multiple days to balance.

If you have large and sequentially written files, and no snapshots, data chunks will be relatively fast (no more than a minute each). If you have many small files or your workload does many small random writes to larger files, and you have many snapshots of these files, data chunks will be relatively slow (an hour or more for each chunk).

this means it would take ~5 months to balance the data?

6 TiB of data on spinning drives should take about 2-3 days if you don't have massive data fragmentation and you are not balancing the metadata.

rrueger commented 11 months ago

Thank you for taking the time to explain this in such detail!

The best approach for this filesystem would have been to run a full balance...

The exact timeline of events was as follows

Disks were running out of space (50GiB left per disk = 200GiB total) (4x4TB drives in RAID10, data and metadata)
I added a 5th disk (5TB drive)
I had to write some 250 GiB to the array quickly, could not wait days for balance
Wrote my 250 GiB

The disks looked something like this (made up numbers)

 1 /dev/sdd1 3.44TiB  45.75GiB         -   158.24GiB
 2 /dev/sdi1 3.44TiB  45.25GiB  32.00MiB   158.71GiB
 3 /dev/sde1 3.44TiB  45.75GiB  32.00MiB   158.21GiB
 4 /dev/sdg1 3.44TiB  45.25GiB  32.00MiB   158.71GiB
 5 /dev/sdf1  120GiB   5.00GiB  32.00MiB     4.52TiB

Then I began a regular --full-balance and it moved all the data from disk 5 onto the first four, leaving the data of the 5th disk "empty" as seen in the fi usage from further up in the thread.
Since this seemed to be working in the "wrong direction" (extremely slowly), I cancelled this balance.
Then I began the metadata balance from the beginning of this thread, to see if btrfs would even use the 5th disk.

Now we are at the beginning of the thread

...but a full balance now would not produce the desired effect because all the metadata has been moved to the "front" of balance order.

Do you mind explaining what this means in a little more detail? Can this still be "fixed"?

If a chunk has hundreds of thousands of small extents with thousands of references to each one, that data chunk may require multiple days to balance.

We have a winner

# btrfs subvolume list -r /btrfs | wc -l
14381

I naively (wrongly) convinced myself that btrfs balance somehow works directly on the stuff on disk, and doesn't care about snapshots etc --- essentially, literally redistributing the chunks without "looking inside them"

So my plan of action should now be:

Cancel the metadata balance
Delete 95% of the snapshots
btrfs balance ???

What "type" of balance should I do in step 3?

balance -dconvert=raid10 -mconvert=raid1 (To put the metadata in RAID1 to prevent the issues you mentioned)
balance -dconvert=raid10
balance -d
balance --devid=1..4 (To leave the ~50 metadata chunks that I have now balanced with the currently running balance)

Thank you again for taking so much time with this issue

It would be great if some of these things were described in the documentation with more detail. When I was reading online, there is a lot of conflicting information (I got the idea to run only metadata first from some forum)

Especially the warning about many snapshots

rrueger commented 11 months ago

Update: The average time required per block considered in the metadata balance has dropped from 12m to 3m over the last 24 hours

Zygo commented 11 months ago

The average time required per block considered in the metadata balance has dropped from 12m to 3m over the last 24 hours

Considered doesn't really mean anything. Most of the considered number is chunks excluded from the balance filter, so it won't tell you much about each chunk that is actually balanced. Divide the total running time by the number of chunks completed to get the time per chunk.

I naively (wrongly) convinced myself that btrfs balance somehow works directly on the stuff on disk, and doesn't care about snapshots etc --- essentially, literally redistributing the chunks without "looking inside them"

That's the obvious optimization that somebody should get around to implementing one day. :-/

Do you mind explaining what this means in a little more detail? Can this still be "fixed"?

Something like:

btrfs balance start --full-balance /btrfs
# In dmesg, make note of the first "relocating block group NNN" message from this balance
# Put the NNN number in $vaddr below
# Cancel the balance once you have this number
for x in $(seq 0 150); do
     btrfs balance start -dlimit=68,vaddr=1..$vaddr /btrfs;
     btrfs balance start -mlimit=1,vaddr=1..$vaddr /btrfs;
done

The vaddr range prevents any repetition of work. Balances start with the highest existing vaddr and work downward, so running a balance for a single chunk gives the highest vaddr that currently exists in the filesystem. New chunks always have higher vaddr numbers, so the vaddr range filter is saying "only balance chunks that are older than the newest chunk that existed when I started this balance."

The alternating between d and m balances places one metadata chunk between every 68 data chunks, roughly replicating the original allocation layout.

The loop does 150 iterations which should be more than enough for all the data--as the end of the loop approaches, the balances will relocate 0 chunks, and you can stop looping when both balances relocate 0 chunks.

If you want to convert metadata to raid1 at the same time, change the -m line above to

btrfs balance start -mconvert=raid1,limit=1,vaddr=1..$vaddr /btrfs;

rrueger commented 11 months ago

Brilliant, thank you. That makes a lot of sense

The alternating between d and m balances places one metadata chunk between every 68 data chunks, roughly replicating the original allocation layout.

The logic presumably being 68 = 6.8 TiB Data / 100GiB Metadata?

The loop does 150 iterations which should be more than enough for all the data

For the same reason, and because 150 > 101 (expected number of metadata chunks)?

(From a previous message)

Balancing metadata first after adding a new device will place one stripe from every chunk of metadata on the new device, creating a hot spot for all metadata.

Am I understanding this correctly: Every chunk is cut into 5 stripes (RAID0 part of RAID10), and each stripe is written to 2 different disks (RAID1 part of RAID10). Usually, the disks are roughly equally full, so each pair of stripes is "ranomly" written to any two (pairwise different) disks. However, in this case, disk5 is empty, so at least one half of each pair of stripes is written to disk 5. This would mean that disk5 experiences a lot more reads (and writes?) than the rest. Essentially pooling half the metadata chunks on disk5 instead of evenly distributing them.

One final question: at the end of this balance, I expect every disk to be filled proportionally to its size?

Zygo commented 11 months ago

Each chunk is composed of 2 stripes, each of which is written to 2 different drives. The 5th device is unused because raid10 can use only an even number of devices (each stripe must have 2 copies, and there cannot be a fractional number of stripes in a chunk). Each chunk is allocated on the 4 devices with the most unallocated space. As the devices fill up, the devices used for the Nth chunk have less unallocated space available for the N+1st chunk, so the allocator will rotate through the drives, leaving a different drive unused for each chunk. The unused drive in each chunk is chosen by the allocator to keep the amount of unallocated space equal on each drive (or move 1 GiB closer to equal with each allocated chunk). Since the 5th device is larger than the others in this case, it will fill up faster than the other 4 devices at first, and part of every chunk will be on the 5th device until equilibrium is reached.

If there is an even number of devices, raid10 allocates on every device at the same rate. As the smaller devices fill up, allocations stop on the filled devices and continue on the devices with remaining unallocated space, until all space is allocated.

If there is an odd number of device sizes (e.g. 5 drives, each a different size) then raid10 uses a mix of the above two algorithms until the maximum amount of space is used (which may not be all of the space in some cases).

The final layout should be something like https://carfax.org.uk/btrfs-usage/?c=2&slo=1&shi=100&p=0&dg=1&d=4520&d=3600&d=3600&d=3600&d=3600

rrueger commented 11 months ago

Apparently vaddr is not recognised by balance. Is this a new feature I need to wait for, or perhaps a typo? The documentation makes no reference to vaddr. vrange and drange exist and look kind of simliar. I can also supply a range to limit, but I am not sure this will have the same effect

Zygo commented 11 months ago

Sorry, vrange is the correct filter name, not vaddr.

rrueger commented 11 months ago

Perfect. Thank you!

rrueger commented 11 months ago

Since this balance operation is working in a loop, and will likely take a week or two is it sensible/advisable, to pause the balance between iterations, use the disk like normal, and then resume the balance afterwards? Will this mess with the balance?

i.e.

#!/bin/bash

vaddr=NNN

for x in $(seq 0 150); do
  c=0
  while [[ -r $HOME/balance-pause ]]
  do
    echo -en "\rDetected '$HOME/balance-pause'. Sleeping... $c"
    sleep 1
    ((c++))
  done

  sudo btrfs balance start -dlimit=68,vrange=1..$vaddr /btrfs
  sudo btrfs balance start -mconvert=raid1,limit=1,vrange=1..$vaddr /btrfs
done

Then when I want to write some data, I touch ~/balance-pause and wait for the balance script to start pausing. Now I write some data. Then I rm ~/balance-pause and allow the balance to continue.

Zygo commented 11 months ago

With the vrange filter, the balance will never repeat work (*), so you can stop and start the script at any time. The balances will become no-ops when the vrange is empty.

When adding or upgrading devices in a filesystem, I often arrange for the balance to run a few hours per night instead of all at once. It takes a few months to make all the new space available instead of a few days, but the system is much more usable in the meantime. We have data sets that grow at slow and predictable rates, so we can translate that requirement into a certain number of free GiB available each day, and run balance just long enough to reach that number overnight.

One gotcha: df will report free space inaccurately until the entire balance is completed, with an error value proportional to the amount of balance work remaining. You should monitor btrfs fi usage -T output, and make sure none of the disks are allowed to run out of unallocated space too early. If that happens, stop adding new data to the filesystem and let the balance run until unallocated space has been restored. Once the entire balance script is finished, df should be accurate again.

(*) (Note for anyone reading this out of context: the vrange limit is required for this to work, or a similar no-repeat filter combination like convert with soft. Balance with default options does not resume properly--it's a bug as old as the balance resume feature itself)

rrueger commented 11 months ago

Okay, the script has now completed and my usage looks like this

Overall:
    Device size:                       19561.46GiB
    Device allocated:                  14904.06GiB
    Device unallocated:                 4657.39GiB
    Device missing:                        0.00GiB
    Used:                              14843.47GiB
    Free (estimated):                   2328.99GiB      (min: 2328.99GiB)
    Data ratio:                               2.00
    Metadata ratio:                           2.00
    Global reserve:                        0.50GiB      (used: 0.00GiB)

             Data       Metadata Metadata System
Id Path      RAID10     RAID1    RAID10   RAID1   Unallocated
-- --------- ---------- -------- -------- ------- -----------
 1 /dev/sdg1 2768.00GiB 24.00GiB  2.00GiB       -   931.99GiB
 2 /dev/sdf1 2771.00GiB 22.00GiB  2.00GiB       -   930.99GiB
 3 /dev/sdh1 2771.00GiB 21.00GiB  2.00GiB 0.03GiB   931.96GiB
 4 /dev/sdd1 2773.00GiB 20.00GiB  1.50GiB       -   931.49GiB
 5 /dev/sde1 3677.00GiB 47.00GiB  2.50GiB 0.03GiB   930.96GiB
-- --------- ---------- -------- -------- ------- -----------
   Total     7380.00GiB 67.00GiB  5.00GiB 0.03GiB  4657.39GiB
   Used      7379.70GiB 41.54GiB  0.49GiB 0.00GiB

I have three questions:

I presume I can convert the remaining RAID10 metadata to RAID1

btrfs balance start -mconvert=raid1,profile=raid10 /btrfs

I think the RAID10 metadata left over is from data written after the balancing was started and therefore not touched because of the vrange=1..$vaddr filter.

How do I prevent future metadata being written in the RAID10 profile?
It looks like I still have a metadata hotspot on disk 5. Is this a problem? I had to cancel the balance operation a few times to reboot the machine. This meant that some of the metadata balances were skipped, and a a metadata chunk was only balanced after maybe 100 data chunks.

Zygo commented 10 months ago

btrfs balance start -mconvert=raid1,profile=raid10 /btrfs

Better: btrfs balance start -mconvert=raid1,soft /btrfs

I think the RAID10 metadata left over is from data written after the balancing was started and therefore not touched because of the vrange=1..$vaddr filter.

That's a reasonable explanation.

How do I prevent future metadata being written in the RAID10 profile?

Remove all RAID10 profile metadata by converting it to another profile. This can be done by simply repeating the above command until the RAID10 profile metadata goes away, which should happen on the first iteration.

btrfs uses a hardcoded list of preferred profiles when multiple profiles are present, and there is no converting balance in progress. This list will choose RAID10 over RAID1 with no compelling rationale.

It looks like I still have a metadata hotspot on disk 5. Is this a problem?

There's always going to be more data on disk 5 because it's larger than the others and the profiles you're using will prefer to allocate on devices with more space. There's 87 GiB on other devices and 47 GiB on disk 5, so a little under half the metadata has both of its mirrors on disks 1-4, and the rest has one mirror on disk 5. This ratio will probably shift a little as the rest of the free space fills up.

You could run iostat in polling mode and see if the %utilization of any of the devices is significantly higher than the others (particularly disk 5) but the results can be difficult to interpret. Normally, hot spots will shift from device to device over time as different block groups are used for new metadata. This measurement can also be confounded by different drive models simply having different performance (i.e. it might tell you only that the new device is slower or faster than the old ones).

kdave / btrfs-progs

`btrfs balance status` giving nonsensical information #654