bblu --kthxbye - Githubissues

knorrie commented 3 years ago

So, there's btrfs-balance-least-used, or bblu as we might call it. The reason this example program was created was to try defragment free space as efficient and quick as possible. I needed this to fight or to recover from situations in which the old -o ssd allocator was being used.

So what's the tool still good for now? Well, users still regularly ask for something that they can run periodically to prevent getting into unexpected ENOSPC situations because of whatever other reason. bblu could of course be used for this, by telling it to compact stuff until at least all block groups are at some % of usage. But, that would likely mean that it's often doing a lot of unnecessary work.

It would be interesting to make it a bit smarter so that it executes the minimal amount of work necessary with a goal of making sure there's actually usable unallocated raw disk space present. How hard can it be? Well, for example, if we're having 100G of unallocated disk space, but it's on 1 disk of 2 and the target profile is RAID1... fail.

What I'm thinking about is some fire-and-forget mode to run it in, in the title jokingly called --kthxbye, but maybe something like --auto. It should use a clear set of rules that we think need to be met.

Now, the python-btrfs library already has the fsusage module which provides a large amount of interesting information that can be used: https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.fs_usage.FsUsage The btrfs-usage-report tool simply displays almost everything it can tell you.

There's estimated_allocatable_virtual_data, estimated_allocatable_virtual_metadata, which tell you (even taking current data/metadata usage ration in account) how much more of both you can store on the fs inside new chunk allocations that can still be done. This seems like an easy one to look at and try to get it on or above some limit.
If we're under that limit, we can just start doing basic bblu that goes compact stuff, and stop when ready, but it still might do way too much work on the wrong disk to e.g. overcome the above example with one full disk and RAID1.
In fsuage, there's also unallocatable_reclaimable which tells us how much unallocatable space can be recovered for use because of unbalanced allocations. If needed, we can go that path to get more unallocated space available. It would of course be more interesting to see what needs to be done to in a smart way for that. (feeding stuff to balance from the disk that has lowest unallocatable number?)

Next: but how do we figure out which block groups exactly need to be fed to balance to fix the unbalanced situation?

knorrie commented 3 years ago

Oh actually, free_(meta)data is not the right one, since it also includes free space in existing chunk allocations, and the ENOSPC happens when the filesystem wants to force new chunk allocations.

Instead, estimated_allocatable_virtual_(meta)data can be used, which tells us how much actual (meta)data can be stored in completely new allocations that can still be done. (edited above)

Zygo commented 3 years ago

Things I learned from the "Century Balance" bug miner:

Some block groups are locked or otherwise immobilized in btrfs. There are several possible reasons for this. I haven't identified them all. Could be a kernel bug (there have been some of those), or there's a swap file, or there happens to be a scrub running and it has locked those BGs. Any automated strategy has to be able to realize those BGs won't move no matter how many times balance is requested, and skip over them. CB machine would get stuck in loops where it kept trying to balance the least-used BG and never made any progress because that BG wouldn't go away.
Sometimes balancing a random block group is better than one chosen by any deterministic algorithm. Mostly these cases arise because of the BG locking issue, but sometimes we just need a block group that has a specific mix of extent sizes to fit in currently existing free space holes, and neither the highest-vaddr BG (chosen by kernel balance) or the least-used BG (chosen by bblu) is suitable. We could look at the extent tree and measure extent sizes and free space holes and try to find a match, but usually picking vaddrs at random works within a few attempts or not at all.
Balance can only delete existing block groups. btrfs always puts new data on the disk with the most unallocated space (once all existing block groups are full, unless you have metadata preference patches). Either it is trying to fill the most-unallocated disk first, or it's trying to fill every disk. Frame the question as "what block groups do we need to delete to make this filesystem work?" as that is the hammer we have in our toolbox.
In most cases we want to balance block groups on the devices with the least unallocated space when those block groups do not have a chunk on the device with the most unallocated space. If we try to relocate a BG that is already present on the most unallocated disk, btrfs will just put the data somewhere else on that disk, so it's not as effective (but necessary in some cases, like striping profile BGs that are not using all available disks). This handles the common case of adding a disk to an array, or replacing a smaller disk with a larger one.
For maintenance, we run a loop:
- Find the device(s) with the most unallocated and least unallocated space.
- Select block groups where there is a chunk on a least-unallocated device and no chunk on any most-unallocated device (the fullest and emptiest disks).
- Sort the selected block groups by usage.
- Balance the least-used one. Remember the vaddr--if we see it again, then pretend it doesn't exist and skip to the next entry in the sorted list.
- Repeat all the above until:
  - All BGs are skipped. Something is holding locks on BGs and we won't be able to make much progress until whatever that is lets go of its locks.
  - Unallocated space on every disk is within a chunk size of all the others (i.e. until they are all +/- 1 GB unallocated).
  - errors, time or work limits, etc.
- I have a one filesystem that never reaches the exit condition. Haven't figured out why yet (it always has ~50 GB unallocated on one of 3 disks in raid1).

How much metadata space do you need?

Used metadata space is the sum of:
- used space in every metadata BG
- global reserve (count it as used space not free space)
- (best case allocation of unallocated space with current data profile + sum of free space in all existing data BGs) / page_size * csum_size (this is how much metadata you'd have if you filled the disk with blocks in datasum files)
Allocated metadata space = sum of size of every metadata bg except the N most-empty BGs. Balance and scrub may lock these, so exclude the BGs that would be the worst possible case:
- 1 for balance which will lock a BG
- 1 for each disk because each per-disk scrub thread will lock one BG as it runs
Available metadata space = best case allocation of unallocated space with current metadata profile
Used metadata space * 1.25 should be less than (allocated + available) metadata space; otherwise, we need to balance some data BGs. If there isn't enough free space in data BGs to meet this target we could say "sorry, can't fix this with balance."

e.g. a 2-disk raid1 filesystem with 4 GB of metadata BGs and 3.5 GB of metadata used must have have enough room for 8 GB of total metadata (3.5 GB used + .5 GB reserve = 4GB, 4GB * 1.25 + 2 GB for disks + 1 GB for balance = 8GB).

So far all metadata ENOSPC failures I've seen have occurred when used metadata space * 1.12 > (allocated + available) space. 1.25 is slightly larger. It is basically a fudge factor to estimate how many snapshot metadata pages are going to get CoWed within their lifetime.

Forza-tng commented 3 years ago

I like the idea of having a daemon doing regular balance across all the devices in the filesystem to even things of. Set and forget seems perfect. Possibly with email reports :)

knorrie commented 2 years ago

..oO(Yes, or use the kernel trace point in the chunk allocator as trigger to wake up and quickly look around if something needs to be done.)

knorrie / python-btrfs

bblu --kthxbye #29