Feature request: Support BTRFS and XFS Reflink source volumes

tlaurion commented 3 years ago

Thoughts? https://github.com/QubesOS/qubes-issues/issues/6476

https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

tasket commented 3 years ago

The two interesting parts about this are the suggestion that Thin LVM is less reliable than Btrfs (this might be accurate), and the point about providing authentication (which might not be accurate).

I could make a point about perceived efficiency and speed for Thin LVM vs Btrfs, the main one being that no one ever seems to actually compare them with benchmarks, not even @michaellarabel. My experience says that Btrfs would lag behind Thin LVM in overall use, but that is just my impression. I also saw a tendency for Btrfs to "blow up" where metadata use would suddenly skyrocket when reflinking large image files in combination with snapshotting the parent (sub)volume; this was with the late 3.x kernels so ymmv.

Its worth noting WRT the future of Linux storage, Red Hat appears to actively dislike both Thin LVM and Btrfs and they are reported to be building a successor flexible storage system called Stratis

Since interest in backups on Qubes (at least incremental backups) is not high, a change to using Btrfs as the Qubes default would not impact Wyng greatly. But also, adding Btrfs support to Wyng should not be a huge undertaking if people want it.

tasket commented 3 years ago

A quick note about Stratis...

It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).

DemiMarie commented 3 years ago

@tasket @tlaurion Would you be willing to comment on QubesOS/qubes-issues#6476? That is a mere proposal, not a final decision, and commentary (including by those who are not QubesOS users!) would be greatly appreciated. I am no expert whatsoever on the Linux storage stack.

tasket commented 3 years ago

I am still going to wait for detailed benchmark comparisons before supporting this. As it stands now, the general wisdom and experience is that Btrfs can be slow, and large disk image files with snapshots is exactly its worst performance case.

Even ZFS created a special mode (ZVOLs) to handle disk images efficiently.

I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.

DemiMarie commented 3 years ago

I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.

Snapshots automatically turn CoW back on, so nodatacow will not help.

tasket commented 3 years ago

IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.

DemiMarie commented 3 years ago

IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.

In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.

DemiMarie commented 3 years ago

A quick note about Stratis...

It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).

Stratis uses device-mapper thin volumes (without LVM) to store its XFS filesystems.

tasket commented 3 years ago

In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.

Yes, so the difference in performance should be somewhere between the cases shown in these benchmarks. We still need benchmarks that are performed in a Qubes environment.

In relation to Wyng, Stratis mapping should be very similar since the current thin-pool method is to ask LVM what the dm-thin device ID is, then use the dm-thin tools on that device.

tasket commented 1 year ago

Work has begun on Btrfs reflink volume support. The algorithms needed to obtain metadata and find differences between two snapshots were added, however at present the code needed to recognize and snapshot reflink vols still needs to be written to make this usable.

A side-effect of the approach I took (using simple FIEMAP tables obtained via filefrag) is that other filesystems that report this data, such as XFS, will also be supported.

tasket commented 1 year ago

To continue a line of thought from code comments:

Its worth noting that file extent maps have 4KB blocks, which is an order of magnitude more detail than the most detailed thin lvm map with 64KB chunks. So 'do it in Python' is a big maybe here, as even Python libs tend to fall down on either speed or memory requirements. Using Linux commands to pre-process the maps gives me delta lists (to use in Python) that are much smaller than the input maps, and they're fast and work on data streams instead of in memory. Python's difflib does look interesting, though. I would love to see an alternate implementation using that or something similar to see how it performs.

Right now the Wyng alpha work in progress is balancing different qualities like low dependency count, CPU portability (as in use cp and its ported!), efficiency and overall speed. Some of the choices I'm making (for now, at least) to move forward and retain those qualities means code that is less aesthetically pleasing or in the case of sed just plain harder to read. (I do respond to requests to add comments to segments of code.)

I'd also like to note that our systems are based on the same Linux commands that I'm invoking from Wyng, and I'm being pretty conservative in my choices. I would consider custom re-implemention of those commands' functions or replacement with 3rd-party libs to be as much or more of a security risk.

tasket commented 1 year ago

~~Major problem:~~

The Linux FIEMAP ioctl output doesn't carry block device numbers, ~~which are needed when a Btrfs volume spans more than one device. With a multi-device fs, the returned data looks OK but won't be correct.~~ This does not affect XFS because that fs doesn't have multi-device maps.

Edit: On further inspection, Btrfs may be synthesizing its own singular address space to account for multiple devices. So we are seeing the numbers from Btrfs' internal raid. If this is true, then the resulting FIEMAP data may be good enough to reliably show where reflinked files have the same blocks.

Edit 2: The issue/solution is explained in a Linux bugzilla record.

tasket commented 1 year ago

I've added close checking of the column layout to the sed script; any significant change should raise an error.

Also checked the filefrag source code. The basic format hasn't changed for well over a decade and the last change ~11ya was minor, adding dots and colons after numbers.

The next hurdle will be getting Wyng to recognize & access regular files as logical volumes. At that point, this feature will be ready to test.

tasket commented 1 year ago

OK, so over in filefrag land, a prominent Linux dev doesn't want me to use filefrag with Btrfs because:

the FIEMAP ioctl wasn't intended for this use

Egads. The FIEMAP describes the data composition of the file. But he is implying the ioctl strips something important from FIEMAP data (it doesn't because Btrfs virtual addresses encompass multiple devices).

Plus meaningless hand waving about Btrfs subvolumes (as if this were the debate about Btrfs inodes) and total lack of concern about filefrag used on other raid-like storage, and I get the impression Btrfs is not exactly TT's area. IOW, this looks like get-off-my-lawn bs. Unless a Btrfs dev says an extent address is not unique within a Btrfs filesystem, I consider the question settled.

tasket commented 1 year ago

Update: Since I've been lured into combing Btrfs dev notes and source code to address spurious claims about the supposed deep, dark messy pit that is Btrfs internals, I keep seeing details that are actually reassuring. Btrfs does indeed use logical extent addresses (claiming it doesn't is weird), they are a crucial part of the disk format itself, and – the really good part – they are one of the higher-level abstractions in the format. What the Btrfs design is telling me so far is that they wanted to insulate extent ~~addresses~~ organization and mundane file I/O from the vicissitudes of low-level RAID maintenance. (Edit– addresses can change due to internal maint. functions, but not without incrementing the fs or subvol generation id.) The chart at the bottom of this page gives a general overview. I think a more abstract extent concept makes reading them from a source like FIEMAP even less worrisome than usual, if all you want are extent addresses and sizes. We should just accept that what comes out of the "physical" fields in that ioctl is virtual in most cases, regardless of the filesystem used. TL;dr all we care about is two files pointing to the same extent are pointing to the same data, and whether its mdraid/lvm etc or Btrfs providing ultimate translation and access to physical data blocks is of no concern.

All this is making me eager to start testing Wyng on multi-device Btrfs setups. And if big issues do arise, there is still XFS as a way to do reflink snapshots.

tasket commented 1 year ago

Local storage abstraction classes including ReflinkVolume have been added. Most required functions are now there, including the ability to make read-only Btrfs subvolume snapshots and monitor fs maintenance incursions via the snapshot's transaction generation property.

This changes Wyng's model of local storage from collections of Lvm_VolGroups containing tables of Lvm_Volumes and pools to a single LocalStorage class pointed at the archive's local storage location. The resulting 'storage' object's lvols dict is populated with objects based on relevant volume and snapshot names (which may or may not exist).

The next steps will be:

[x] Make the snapshot + generation handling transparent, so as not to affect non-Btrfs reflink systems
[x] Accommodate subdirs in volume names, making Wyng volume names like tar paths
[x] Add these functions to _get_reflinkdeltas()
[x] Convert receive/verify/diff functions to allow data verification tests
[x] Convert the _monitorsend() chain of functions to use the abstract storage objects
[x] Test wyng monitor and wyng send
[x] Convert other Wyng commands to use abstract storage
[x] Test the rest

Also to do:

[x] Check whether local storage is Btrfs/subvolume or XFS
[ ] Option to convert the Btrfs dir (referenced by --local) to a subvol

DemiMarie commented 1 year ago

@tasket: what advantages will Wyng have over e.g. btrfs send?

tasket commented 1 year ago

@DemiMarie

Wyng archive requires only a traditional fs (or semantics that encompass a Unix fs, like sftp and s3) on the backup destination, instead needing specifically Btrfs on the destination. The only way to get around this withbtrfs receive is to stack up the send streams like cordwood, which leaves you with a very inefficient/tedious restore process and no archive maintenance functions.
The 'cordwood' scenario is probable if encryption functions must remain in the local admin env.
Wyng can work with other snapshot-capable local storage, and the user isn't tied to restoring to the same type of local fs as what they originally had... they can restore directly to non-COW storage if desired.

Edit: One could tongue-in-cheek say that the reasons for using Wyng are the reasons why qvm-backup doesn't use btrfs send. :)

Edit:

Wyng's monitor function lowers disk-space consumption for snapshots because snapshots (both reflinked img files and subvol snaps) are deleted after a delta map is made from them. So Wyng enables continuous rotation of snapshots, even when backups aren't being sent. btrfs-send requires that local snapshots stay in-place where the disk space they consume will keep growing in size until the next backup.

tasket commented 1 year ago

@tlaurion @DemiMarie Wyng now has basically a full implementation of reflink support and is ready to try out on Btrfs for anyone curious enough at this stage (note: it still has not yet returned to alpha).

The prerequisite for using Wyng with Btrfs is to make the --local directory a subvolume, such as sudo btrfs subvolume create /var/lib/qubes or use whichever _dirpath your Qubes Btrfs pool uses:

$ qvm-pool info btrpool
name                btrpool
dir_path            /mnt/btrpool/libqubes
driver              file-reflink
ephemeral_volatile  False
revisions_to_keep   1

Since we are now accessing local filesystem objects, you must be mindful of directory structure. In fact, the current implementation treats subdirectories as part of the Archive volume's name. To demonstrate, send-ing a Qubes VM's disk image file to the archive looks like this:

sudo wyng --local=/mnt/btrpool/libqubes send appvms/untrusted/private.img

You don't have to specify --local if the archive already has that local setting. But showing it this way demonstrates:

--local can now be specified at any time (not just with arch-init)
reflink mode is automatically detected
reflink mode accesses disk images simply by using --local as a base path and the volume name as the rest. Your system configuration determines how messy or neat the volume naming will be (but, yes, wyng-util-qubes will cope with this automatically).

It also raises the question of whether users might want to set aside a special dir where they create symlinks to the image files they want to back up, and then point Wyng at that special dir. This would be interesting to try.

tasket commented 1 year ago

Btrfs reflink and LVM have now been tested and are working.

tasket / wyng-backup

Feature request: Support BTRFS and XFS Reflink source volumes #75