kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
557 stars 242 forks source link

Ability for Dynamic Storage Tiering - NVME (superfast) + SSD (mid-tier) + HDD (slow) - manipulate 'brtfs balance' profiles #610

Open TheLinuxGuy opened 1 year ago

TheLinuxGuy commented 1 year ago

Could brtfs implement a feature to support multiple devices of different speed/types with a profiling algorithm for data balancing? In other words - dynamic storage tiering.

Assume a user with a combined brtfs filesystem with:

To keep things simple, assume no redundancy in each tier. The goal the user is looking for is to ensure the maximum performance and for the storage in the filesystem to be as optimized as it can be within some customizable settings (e.g: how much nvme space should be left "free" for writeback caching of new I/O).

As I was thinking brtfs-balance already does some of the filesystem optimization by balancing disk space utilization evenly across each disk. This feature is asking for more options to change how should brtfs-balance should work and how new I/O writes are handled so that 'tier 1' is always the priority.

Least used data blocks not recently accessed would be "downgraded" or moved down to a lower tier if the user hasn't accessed those data blocks and as the filesystem usage grows demanding some purging / rebalance.

TheLinuxGuy commented 1 year ago

Also from my research, it seems that Netgear may have forked brtfs already to achieve this and they implemented their own algorithm for storage tiering in their now defunct ReadyNAS OS.

See page 10 of https://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNAS_FlexRAID_Optimization_Guide.pdf and https://unix.stackexchange.com/questions/623460/tiered-storage-with-btrfs-how-is-it-done?answertab=modifieddesc#tab-top

kdave commented 1 year ago

I was not aware of that, thanks for the links. It seems that readynas is not maitained and I can't find any git repositories assuming it's built on top of linux. Their page also does not mention 'btrfs' anywhere. The storage tiers are a feature people ask for so no surprise that somebody implemented that outside of linux but merging that back would be desirable. I haven't seen the code so it's hard to tell in what way it was implemented and if it would be acceptable, vendors often don't have to deal with backward compatibility or long term support so it's "cheaper" to do their private extensions instead.

Forza-tng commented 1 year ago

There is the patch set for metdata-on-ssd somewhere. This, I think would be a good middle-ground if they were accepted into mainline kernel. https://patchwork.kernel.org/project/linux-btrfs/patch/20200405082636.18016-2-kreijack@libero.it/

Duncaen commented 1 year ago

https://www.downloads.netgear.com/files/GPL/ReadyNASOS_V6.10.8_WW_src.zip

The paths I looked at are:

btrfs-tools-4.16/debian/patches/0010-Add-btrfs-balance-sweep-subcommand-for-dat-tiering.patch
linux-4.4.218-x86_64/fs/btrfs

I haven't looked at the full diff since the kernel is pretty old and much has changed, but basically it looks like it adds another sort function to sort the devices in __btrfs_alloc_chunk2 (btrfs_create_chunk now) sorting the device by a class attribute. And then an ioctl for a "sweep" filter for balance.

studyfranco commented 11 months ago

This would be a fantastic addition to Btrfs. I'd like to emphasize the importance of being able to specify sub-volume affinity. Imagine having sub-volumes for /, /var/log, and /home. Here's the concept:

In this system, data from / is given the highest priority for storage space on tier 1, with a lower priority for /var/log and /home on the same tier. Similarly, data from /var/log is given the highest priority for storage space on tier 3, with a lower priority for /var/log and /home on the same tier.

I imagine two parameters to implement this:

This level of control over data placement within sub-volumes would be a game-changer. It allows for finely tuned optimization of storage resources based on specific usage scenarios. It would further solidify Btrfs as a powerful and flexible file system for data management.

Forza-tng commented 11 months ago

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. https://github.com/kakra/linux/pull/26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

kakra commented 11 months ago

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: https://github.com/kakra/linux/pull/31

studyfranco commented 10 months ago

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26 They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: kakra/linux#31

This is a very good begin. But, my use case (and my proposition) is more complex. I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

kakra commented 10 months ago

This is a very good begin. But, my use case (and my proposition) is more complex. I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

Currently I'm solving it this way:

I have two NVMe drives, each drive has a 64GB meta-data-preferred partiton for btrfs. The remaining space is md-raid1, then bcache backing partition put into it. All HDDs (4x 4TB) are data-preferred partitions formatted on bcache writeback backend partition and attached to the md-raid1 cache.

This way, meta data is on native NVMe because bcache doesn't handle cow metadata very efficient, and I still get the benefits of having hot data on NVMe. I'm using these patches to exclude some IO traffic from being cached (e.g. backup or maintenance jobs with idle IO priority): https://github.com/kakra/linux/pull/32

I achieve cache hit rate of 96% and bypass-hits of 95% (IO requests that should have bypassed caching but already have been in cache) for a 800 GB cache and 4.2TB used btrfs storage.

Actually, combining bcache with preferred meta data worked magic: cache hit rates went up and response times went down a lot. Transfer rates peak around 2 GB/s which is slower than native NVMe but still very good. Average transfer rates are around 300-500 MB/s with data coming partially from cache and HDD. Migrating this setup from single-SSD to dual-NVMe improved perceived responsiveness a lot. Still, due to cow and btrfs-data-raid1, bcache cannot work optimally and wastes some space and performance. A better integration of both would be useful where bcache would know about btrfs-raid1 and store data just once, or cow would inform bcache about unused blocks.

bugsquasher1991 commented 1 month ago

I would like to add to this feature that it would also be a great idea to have tiered storage on a directory or file level. Meaning, making the "tiering" a property of a directory or file itself:

This could be done for both data and metadata.

As proposed, we could even have different "tier level" defined for use cases like NVMe <-> SATA SSD <-> HDD.

By making tiering a property of a file or directory, people could mark certain files they would always want to be accessable fast (e.g., without spin-up time) in a way that would make the filesystem store them on the fast cache ssds of a pool. This would be a cool way to decide which files are stored on the cache, as opposed to only being able to go by the last accessed data and keeping that in the cache.

Usage case could be e.g. a homeserver where personal files, pictures etc. should always be available without delay, while large media files can be stored on slower rotational drives that take time to spin up.