Closed tlaurion closed 1 year ago
The two interesting parts about this are the suggestion that Thin LVM is less reliable than Btrfs (this might be accurate), and the point about providing authentication (which might not be accurate).
I could make a point about perceived efficiency and speed for Thin LVM vs Btrfs, the main one being that no one ever seems to actually compare them with benchmarks, not even @michaellarabel. My experience says that Btrfs would lag behind Thin LVM in overall use, but that is just my impression. I also saw a tendency for Btrfs to "blow up" where metadata use would suddenly skyrocket when reflinking large image files in combination with snapshotting the parent (sub)volume; this was with the late 3.x kernels so ymmv.
Its worth noting WRT the future of Linux storage, Red Hat appears to actively dislike both Thin LVM and Btrfs and they are reported to be building a successor flexible storage system called Stratis
Since interest in backups on Qubes (at least incremental backups) is not high, a change to using Btrfs as the Qubes default would not impact Wyng greatly. But also, adding Btrfs support to Wyng should not be a huge undertaking if people want it.
A quick note about Stratis...
It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).
@tasket @tlaurion Would you be willing to comment on QubesOS/qubes-issues#6476? That is a mere proposal, not a final decision, and commentary (including by those who are not QubesOS users!) would be greatly appreciated. I am no expert whatsoever on the Linux storage stack.
I am still going to wait for detailed benchmark comparisons before supporting this. As it stands now, the general wisdom and experience is that Btrfs can be slow, and large disk image files with snapshots is exactly its worst performance case.
Even ZFS created a special mode (ZVOLs) to handle disk images efficiently.
I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.
I would wager that the best way to wring performance from Btrfs with disk image snapshots is to flag them nodatacow and add them to separate subvolumes, instead of using reflinks. If that's the case, it would mean a) Qubes getting a refactored Btrfs driver, b) quite different coding details when adding Btrfs to Wyng.
Snapshots automatically turn CoW back on, so nodatacow will not help.
IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.
IIRC nodatacow can be set for individual disk image files that are sitting in a subvolume. So the files only experience a data CoW-like event after a subvol snapshot, not on a second-by-second basis whenever any data is written.
In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.
A quick note about Stratis...
It appears to be a configuration management system for "storage pools", where a pool is an XFS filesystem spanning one or more block devices. XFS is used in reflink mode to manage disk image files and "snapshots" containing online shrink-capable filesystems. Red Hat claims to be doing this bc Btrfs code tree was supposedly not maintainable for enterprise environments. The only tangible benefit I'd expect is a performance advantage over Btrfs (it would be interesting to compare Xfs and Btrfs for hosting large reflinked disk image files).
Stratis uses device-mapper thin volumes (without LVM) to store its XFS filesystems.
In Qubes OS, all persistent volumes have at least one snapshot, by default. So the only difference would be second and further writes to the same extent after qube startup.
Yes, so the difference in performance should be somewhere between the cases shown in these benchmarks. We still need benchmarks that are performed in a Qubes environment.
In relation to Wyng, Stratis mapping should be very similar since the current thin-pool method is to ask LVM what the dm-thin device ID is, then use the dm-thin tools on that device.
Work has begun on Btrfs reflink volume support. The algorithms needed to obtain metadata and find differences between two snapshots were added, however at present the code needed to recognize and snapshot reflink vols still needs to be written to make this usable.
A side-effect of the approach I took (using simple FIEMAP tables obtained via filefrag
) is that other filesystems that report this data, such as XFS, will also be supported.
To continue a line of thought from code comments:
Its worth noting that file extent maps have 4KB blocks, which is an order of magnitude more detail than the most detailed thin lvm map with 64KB chunks. So 'do it in Python' is a big maybe here, as even Python libs tend to fall down on either speed or memory requirements. Using Linux commands to pre-process the maps gives me delta lists (to use in Python) that are much smaller than the input maps, and they're fast and work on data streams instead of in memory. Python's difflib does look interesting, though. I would love to see an alternate implementation using that or something similar to see how it performs.
Right now the Wyng alpha work in progress is balancing different qualities like low dependency count, CPU portability (as in use cp
and its ported!), efficiency and overall speed. Some of the choices I'm making (for now, at least) to move forward and retain those qualities means code that is less aesthetically pleasing or in the case of sed
just plain harder to read. (I do respond to requests to add comments to segments of code.)
I'd also like to note that our systems are based on the same Linux commands that I'm invoking from Wyng, and I'm being pretty conservative in my choices. I would consider custom re-implemention of those commands' functions or replacement with 3rd-party libs to be as much or more of a security risk.
Major problem:
The Linux FIEMAP ioctl output doesn't carry block device numbers, which are needed when a Btrfs volume spans more than one device. With a multi-device fs, the returned data looks OK but won't be correct. This does not affect XFS because that fs doesn't have multi-device maps.
Edit: On further inspection, Btrfs may be synthesizing its own singular address space to account for multiple devices. So we are seeing the numbers from Btrfs' internal raid. If this is true, then the resulting FIEMAP data may be good enough to reliably show where reflinked files have the same blocks.
Edit 2: The issue/solution is explained in a Linux bugzilla record.
I've added close checking of the column layout to the sed
script; any significant change should raise an error.
Also checked the filefrag
source code. The basic format hasn't changed for well over a decade and the last change ~11ya was minor, adding dots and colons after numbers.
The next hurdle will be getting Wyng to recognize & access regular files as logical volumes. At that point, this feature will be ready to test.
OK, so over in filefrag land, a prominent Linux dev doesn't want me to use filefrag with Btrfs because:
the FIEMAP ioctl wasn't intended for this use
Egads. The FIEMAP describes the data composition of the file. But he is implying the ioctl strips something important from FIEMAP data (it doesn't because Btrfs virtual addresses encompass multiple devices).
Plus meaningless hand waving about Btrfs subvolumes (as if this were the debate about Btrfs inodes) and total lack of concern about filefrag used on other raid-like storage, and I get the impression Btrfs is not exactly TT's area. IOW, this looks like get-off-my-lawn bs. Unless a Btrfs dev says an extent address is not unique within a Btrfs filesystem, I consider the question settled.
Update: Since I've been lured into combing Btrfs dev notes and source code to address spurious claims about the supposed deep, dark messy pit that is Btrfs internals, I keep seeing details that are actually reassuring. Btrfs does indeed use logical extent addresses (claiming it doesn't is weird), they are a crucial part of the disk format itself, and – the really good part – they are one of the higher-level abstractions in the format. What the Btrfs design is telling me so far is that they wanted to insulate extent addresses organization and mundane file I/O from the vicissitudes of low-level RAID maintenance. (Edit– addresses can change due to internal maint. functions, but not without incrementing the fs or subvol generation id.) The chart at the bottom of this page gives a general overview. I think a more abstract extent concept makes reading them from a source like FIEMAP even less worrisome than usual, if all you want are extent addresses and sizes. We should just accept that what comes out of the "physical" fields in that ioctl is virtual in most cases, regardless of the filesystem used. TL;dr all we care about is two files pointing to the same extent are pointing to the same data, and whether its mdraid/lvm etc or Btrfs providing ultimate translation and access to physical data blocks is of no concern.
All this is making me eager to start testing Wyng on multi-device Btrfs setups. And if big issues do arise, there is still XFS as a way to do reflink snapshots.
Local storage abstraction classes including ReflinkVolume have been added. Most required functions are now there, including the ability to make read-only Btrfs subvolume snapshots and monitor fs maintenance incursions via the snapshot's transaction generation property.
This changes Wyng's model of local storage from collections of Lvm_VolGroups containing tables of Lvm_Volumes and pools to a single LocalStorage class pointed at the archive's local storage location. The resulting 'storage' object's lvols dict is populated with objects based on relevant volume and snapshot names (which may or may not exist).
The next steps will be:
tar
pathswyng monitor
and wyng send
Also to do:
--local
) to a subvol@tasket: what advantages will Wyng have over e.g. btrfs send
?
@DemiMarie
btrfs receive
is to stack up the send
streams like cordwood, which leaves you with a very inefficient/tedious restore process and no archive maintenance functions.Edit: One could tongue-in-cheek say that the reasons for using Wyng are the reasons why qvm-backup
doesn't use btrfs send
. :)
Edit:
monitor
function lowers disk-space consumption for snapshots because snapshots (both reflinked img files and subvol snaps) are deleted after a delta map is made from them. So Wyng enables continuous rotation of snapshots, even when backups aren't being sent. btrfs-send
requires that local snapshots stay in-place where the disk space they consume will keep growing in size until the next backup.@tlaurion @DemiMarie Wyng now has basically a full implementation of reflink support and is ready to try out on Btrfs for anyone curious enough at this stage (note: it still has not yet returned to alpha).
The prerequisite for using Wyng with Btrfs is to make the --local
directory a subvolume, such as sudo btrfs subvolume create /var/lib/qubes
or use whichever _dirpath your Qubes Btrfs pool uses:
$ qvm-pool info btrpool
name btrpool
dir_path /mnt/btrpool/libqubes
driver file-reflink
ephemeral_volatile False
revisions_to_keep 1
Since we are now accessing local filesystem objects, you must be mindful of directory structure. In fact, the current implementation treats subdirectories as part of the Archive volume's name. To demonstrate, send
-ing a Qubes VM's disk image file to the archive looks like this:
sudo wyng --local=/mnt/btrpool/libqubes send appvms/untrusted/private.img
You don't have to specify --local
if the archive already has that local setting. But showing it this way demonstrates:
arch-init
)It also raises the question of whether users might want to set aside a special dir where they create symlinks to the image files they want to back up, and then point Wyng at that special dir. This would be interesting to try.
Btrfs reflink and LVM have now been tested and are working.
Thoughts? https://github.com/QubesOS/qubes-issues/issues/6476
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup