Open knorrie opened 3 years ago
Oh actually, free_(meta)data
is not the right one, since it also includes free space in existing chunk allocations, and the ENOSPC happens when the filesystem wants to force new chunk allocations.
Instead, estimated_allocatable_virtual_(meta)data
can be used, which tells us how much actual (meta)data can be stored in completely new allocations that can still be done. (edited above)
Things I learned from the "Century Balance" bug miner:
Some block groups are locked or otherwise immobilized in btrfs. There are several possible reasons for this. I haven't identified them all. Could be a kernel bug (there have been some of those), or there's a swap file, or there happens to be a scrub running and it has locked those BGs. Any automated strategy has to be able to realize those BGs won't move no matter how many times balance is requested, and skip over them. CB machine would get stuck in loops where it kept trying to balance the least-used BG and never made any progress because that BG wouldn't go away.
Sometimes balancing a random block group is better than one chosen by any deterministic algorithm. Mostly these cases arise because of the BG locking issue, but sometimes we just need a block group that has a specific mix of extent sizes to fit in currently existing free space holes, and neither the highest-vaddr BG (chosen by kernel balance) or the least-used BG (chosen by bblu) is suitable. We could look at the extent tree and measure extent sizes and free space holes and try to find a match, but usually picking vaddrs at random works within a few attempts or not at all.
Balance can only delete existing block groups. btrfs always puts new data on the disk with the most unallocated space (once all existing block groups are full, unless you have metadata preference patches). Either it is trying to fill the most-unallocated disk first, or it's trying to fill every disk. Frame the question as "what block groups do we need to delete to make this filesystem work?" as that is the hammer we have in our toolbox.
In most cases we want to balance block groups on the devices with the least unallocated space when those block groups do not have a chunk on the device with the most unallocated space. If we try to relocate a BG that is already present on the most unallocated disk, btrfs will just put the data somewhere else on that disk, so it's not as effective (but necessary in some cases, like striping profile BGs that are not using all available disks). This handles the common case of adding a disk to an array, or replacing a smaller disk with a larger one.
For maintenance, we run a loop:
How much metadata space do you need?
e.g. a 2-disk raid1 filesystem with 4 GB of metadata BGs and 3.5 GB of metadata used must have have enough room for 8 GB of total metadata (3.5 GB used + .5 GB reserve = 4GB, 4GB * 1.25 + 2 GB for disks + 1 GB for balance = 8GB).
So far all metadata ENOSPC failures I've seen have occurred when used metadata space * 1.12 > (allocated + available) space. 1.25 is slightly larger. It is basically a fudge factor to estimate how many snapshot metadata pages are going to get CoWed within their lifetime.
I like the idea of having a daemon doing regular balance across all the devices in the filesystem to even things of. Set and forget seems perfect. Possibly with email reports :)
..oO(Yes, or use the kernel trace point in the chunk allocator as trigger to wake up and quickly look around if something needs to be done.)
So, there's
btrfs-balance-least-used
, or bblu as we might call it. The reason this example program was created was to try defragment free space as efficient and quick as possible. I needed this to fight or to recover from situations in which the old -o ssd allocator was being used.So what's the tool still good for now? Well, users still regularly ask for something that they can run periodically to prevent getting into unexpected ENOSPC situations because of whatever other reason. bblu could of course be used for this, by telling it to compact stuff until at least all block groups are at some % of usage. But, that would likely mean that it's often doing a lot of unnecessary work.
It would be interesting to make it a bit smarter so that it executes the minimal amount of work necessary with a goal of making sure there's actually usable unallocated raw disk space present. How hard can it be? Well, for example, if we're having 100G of unallocated disk space, but it's on 1 disk of 2 and the target profile is RAID1... fail.
What I'm thinking about is some fire-and-forget mode to run it in, in the title jokingly called --kthxbye, but maybe something like --auto. It should use a clear set of rules that we think need to be met.
Now, the python-btrfs library already has the fsusage module which provides a large amount of interesting information that can be used: https://python-btrfs.readthedocs.io/en/stable/btrfs.html#btrfs.fs_usage.FsUsage The
btrfs-usage-report
tool simply displays almost everything it can tell you.estimated_allocatable_virtual_data
,estimated_allocatable_virtual_metadata
, which tell you (even taking current data/metadata usage ration in account) how much more of both you can store on the fs inside new chunk allocations that can still be done. This seems like an easy one to look at and try to get it on or above some limit.unallocatable_reclaimable
which tells us how much unallocatable space can be recovered for use because of unbalanced allocations. If needed, we can go that path to get more unallocated space available. It would of course be more interesting to see what needs to be done to in a smart way for that. (feeding stuff to balance from the disk that has lowest unallocatable number?)Next: but how do we figure out which block groups exactly need to be fed to balance to fix the unbalanced situation?