koverstreet / bcachefs

Other
633 stars 69 forks source link

Homogenous storage and foreground volume pinning #684

Open tucnak opened 1 month ago

tucnak commented 1 month ago

Like many others before me, I was reading reading Principles of Operation when it clicked, suddenly.

I'm led to believe that bcachefs may prove the ultimate solution, and indeed, provide the illusion of local homogenous storage. This is something I have been trying to accomplish for quite some time, opting to use a number of crutches. I wonder if bcachefs could potentially address most of those when I would be upgrading my server later this year. For reference: the server in question exposes 4x6 Gbps SATA, 8x6 Gbps SAS ports & a Samsung PM1735 NVMe. This particular NVMe is the most capable disk I have ever worked with; it's able to saturate the PCIe 4.0 x8 slot almost completely with ~96 Gbit total I/O bandwidth in parallel workloads, making it a perfect fit for foreground/promote target, I think.

Bcachefs is supposed to create an illusion of say, 40 TB filesystem that wouldn't be write-limited by any of the hard disks, and could instead stretch out the peak writes across the background targets. Please do correct me if I'm wrong, but this is finally a big improvement over what btrfs/zfs people are doing. The usual benefits of btrfs apply still, such as being able to create snapshots, and/or stream them from SSD's to HDD archive partitions respectively. However, as I wish to similarly leverage the superior NVMe's read performance for a Postgres cluster & because the latency requirements for a realtime database are like that, I have to ensure that the Postgres volume cannot be "evicted" from foreground. I may be mis-understanding how foreground functions, but on the off-chance that I don't, it would really help to have a "pinning" feature that would indicate to bcachefs that a particular volume must remain in the foreground at all times. In my case, the Postgres tablespace is only a fraction of the NVMe's total capacity, so it would be able to satisfy the caching role still. I think there's also the pesky little issue of leverage erasure coding, which is completely orthogonal to md in the context of software RAID6 that I'm now accustomed to.

How long before this could become a reality?

P.S. For anybody who's wondering, the SR-IOV bug in PM1735 has been fixed in the latest firmware.

raldone01 commented 1 month ago

With bcachfs setattr you can actually set the background target replicas and nocow for specific folders. You could set the background target to ssd for your postgres and unset the promote and foreground. If you do this the you may need a second ssd if you want replicas though.

Be careful with 6.9 kernel I am experiencing lots of kernel hangs.