Investigate ZFS and Btrfs

codewithmichael commented 9 years ago

This issue should store investigation notes on the potential advantages and disadvantages of using ZFS or Btrfs for writeable disk images, most notably including the active save data image.

As noted in issue #2, LVM snapshot functionality is currently broken (for Wren's purposes) in Ubuntu 14.04. Now is as good a time as any to look into the proposed alternatives (ZFS / Btrfs).

Version 0.1.1, copy/save.sh, lines 216-218:

###
### DISABLED: Snapshots with udev/overlayfs are currently broken in Ubuntu 14.04
###

Initial Notes:

I believe they are both supported by the current kernel (though ZFS requires installing an additional kernel module, +1 for Btrfs).
Both ZFS and Btrfs provide snapshot functionality similar to what was previously accomplished with LVM.
They both support built-in file system compression, which could be a real bonus for on-disk save images as well as active save data (if the overhead is low enough).
- Compression may cause issues in determining storable data size, as du reports disk usage, not file size, which are significantly different in a compressed file system.
Without further investigation, I don't yet know if either works appropriately with resizable file system images as are currently used for active save data and snapshots.
- When LVM worked correctly, image size was increased and then the volume extents was expanded to match.

codewithmichael commented 9 years ago

Another (major) +1 for Btrfs, from http://zfsonlinux.org/faq.html#WhyShouldIUseA64BitSystem:

You are strongly encouraged to use a 64-bit kernel. At the moment zfs will build in a 32-bit environment but will not run stably.

Wren is currently (primarily) run on Ubuntu Desktop 14.04.1 32-bit, so that pretty much makes ZFS a no-go. However, there's an additional note:

Proper support for 32-bit systems is contingent upon the zfs code being weaned off its dependence on virtual memory. This will take some time to do correctly but it is planned for the Linux port.

So it may still be on the table long-term.

codewithmichael commented 9 years ago

The article Btrfs File-System For Old Computers? verifies that Btrfs runs on 32-bit systems, and also provides benchmarks on the Btrfs mount options.

The article is dated (2011), but some of the information provided likely still applies. A couple important excerpts follow below.

Option descriptions:

nodatasum: Do not checksum data. Means bit flips and bit rot might go undetected, but allows for slightly faster operation since data checksum does not have to be calculated. On most modern CPUs this option does not result in any reasonable performance improvement.

nodatacow: Do not copy-on-write data. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext[234]), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large

nobarrier: Do not use device barriers. NOTE: Using this option greatly increases the chances of you experiencing data corruption during a power failure situation. This means full file-system corruption, and not just losing or corrupting data that was being written during a power cut or kernel panic

compress: Enable compression. In the kernels >2.6.38 you can choose the algorithm for compression:

compress=zlib: Better compression ratio. It's the default and safe for olders kernels.

compress=lzo: Fastest compression. btrfs-progs 0.19 or older will fail with this option. The default in the kernel 2.6.39 and newer.

space_cache: Btrfs stores the free space data ondisk to make the caching of a block group much quicker (Kernel 2.6.37+). It's a persistent change and is safe to boot into old kernels.

autodefrag: will detect random writes into existing files and kick off background defragging. It is well suited to bdb or sqlite databases, but not virtualization images or big databases (yet). Once the developers make sure it doesn't defrag files over and over again, they'll move this toward the default. (Kernel 3.0+)

Btrfs vs EXT4:

The only Btrfs mount option that allowed the Btrfs file-system to perform close to the speed of the EXT4 file-system was when disabling barriers for Btrfs. However, not using device barriers is a hazard as you can end up with a corrupted disk if your system loses power during writes. While disabling barriers can increase the performance, the risk is not generally worth it unless you are confident in your battery system regarding power loss or other crashes. ... Btrfs is at least faster than EXT4 when it comes to the PostMark mail server benchmark. Btrfs with its default mount options was faster than EXT4 by over 10%. The mount options that were of the most benefit to [Btrfs against] EXT4 were compress=lzo (LZO compression) and autodefrag. Other options like nodatasum (no checksum data) and nodatacow (no copy-on-write data) were also of modest benefits. ... With the first FS-Mark, which was pushing 1,000 files of 1MB size, the Btrfs file-system performance matched EXT4 when both were at their defaults. When enabling Btrfs compression (Zlib was beneficial, but LZO compression was of the most benefit) the scores went up. The scores also went up when not enforcing the disk barriers. ... When running the same test but not enforcing sync/fsync calls, the stock Btrfs file-system did even better than EXT4. Btrfs was up by approximately 30%. The Btrfs disk compression was also of tremendous benefit here. The other mount options had little impact on the results.

codewithmichael commented 9 years ago

After playing with Btrfs for a little while this afternoon, I think I can build a workable save/snapshot system to replace the old LVM one. It would require adding the btrfs-tools package as a build dependency, but then we could then drop lvm2 and watershed, which have a larger footprint.

Creating a Btrfs device image and subvolume to hold the save root seems pretty straight forward. The processing order is slightly different from the LVM method, but it should look the same from the user side. I'll experiment with it a bit on a local Wren instance to mirror the existing functionality and then I'll start working on a new save script.

Creating a Btrfs save script that uses the current rsync-to-save method should be easy enough, but I'd also like to experiment with using a Btrfs save image as well. Btrfs has a send/receive pattern for backing up subvolumes (including snapshots) that seems like it might be more performant and complete than the existing rsync method. Additionally, that would allow for creating subvolumes for different sections of the active save data — notably the user's home directory — and backing them up separately if that functionality becomes desirable down the road. It would, of course, require converting existing save images, but we could provide a save image converter script for anyone making an upgrade.

Anyway, if I come up with something workable I'll drop a pull request when I have something substantial.

codewithmichael commented 9 years ago

After some research and fiddling it doesn't look like there's a very neat way to use the Btrfs send/receive feature. btrfs receive immediately sets the received subvolume to readonly upon completion. Even if the subvolume were marked writable, btrfs receive won't write into an existing subvolume.

It may be possible to create a "difference snapshot", send that, and then merge and rename them on the other end, but there are too many places along the way where something could go wrong.

I'll just stick with the existing rsync method. It's worked well so far.

codewithmichael commented 9 years ago

Misc Btrfs bonus note: I failed to mention (though it was implied via assumed feature parity) that we can still increase active save size using Btrfs as we did with LVM. But we can also safely shrink it without taking the file system offline as well (something that was problematic before). This is useful in case of an accidental or over-zealous increase.

codewithmichael commented 9 years ago

Today, as a proof of concept, I got a Btrfs variant of the current save script working (with snapshots).

See commit (branch: replace-lvm-with-btrfs)

Since Btrfs isn't part of the boot process yet, testing the save script requires manually creating a Btrfs volume and root subvolume for the active data and manually populating it (or rsyncing in the save data) and bootstrapping the run environment with additional platform variables by copying in the new platform.conf.

Hopefully in the next day or two I'll get it booting on the Btrfs root subvolume instead of the LVM save volume. If everything works out as planned, we should be able to strip out all the LVM code/dependencies soon.

Applicable save-btrfs.sh test steps (run all commands as root):

Boot new Wren instance
Clone repo and save a new save image
Build with wrender
Copy initrd.img-* and platform-* images into the new save's directory
Update grub.cfg (via update-grub.sh)
Boot the new save
Replace /var/run/wren/conf/platform.conf with the cloned copy/platform.conf
Create and format a sparse image file to store the Btrfs test volume:
- dd if=/dev/zero of=/mnt/wren/xx-images/active.btrfs bs=1 seek=$((1024*1024*1024)) count=0
Create and mount the Btrfs test volume and create root subvolume:
- mkdir -p /mnt/wren/xx-volumes/active
- loop_device=losetup -f``
- losetup "$loop_device" /mnt/wren/xx-images/active.btrfs
- mount "$loop_device" /mnt/wren/xx-volumes/active
- btrfs subvolume create /mnt/wren/xx-volumes/active/@
Add test content to root subvolume, or rsync in active save data:
- rsync -axHAXSv /mnt/wren/04-save/ /mnt/wren/xx-volumes/active/@
Run the new save script — e.g.:
- /etc/wren/save-btrfs.sh -s btrfs-test

codewithmichael commented 9 years ago

Pull request #12 (Replace LVM with Btrfs) fully addresses a Btrfs replacement. It should undergo further testing, but considering how much it simplifies the code base and the importance of re-enabling snapshot functionality, when it's ready I think we should move forward with the change as version 0.2.0.

codewithmichael commented 9 years ago

I was reading through some of the more advanced Btrfs features, and after I gained an understanding of how seeding works, it occurred to me that we may be able to use it instead of OverlayFS for handling root and platform layering. Basically we would create a compressed root volume and turn it into a seed, then create a compressed platform volume added to the root seed.

So far as I can tell, though, we would still need OverlayFS to handle the save layer. While we could turn the platform layer into a seed and then add the save layer onto that, the save layer would then be tied to that platform instance. That doesn't really suit our purposes because we need to be able to rebuild the platform layer without replacing the save layer.

In any case, I'm just brainstorming on the concept a bit. It would require some heavy restructuring since the root and platform would need to exist as Btrfs volume images instead of SquashFS images. But it would allow us to free up an OverlayFS layer, which could potentially have other uses down the road.

codewithmichael commented 9 years ago

On another note, if we start using Btrfs save images for on-disk storage, we could allow multiple save instances in a single save file. That might be more appropriate if we wanted to focus more on user-based save files with multiple instances per user. Of course, then we would be looking at nested Grub entries and that gets messy.

A better use for on-disk Btrfs save files would be to take advantage of the Btrfs snapshotting feature when the user disables in-memory save use in favor of working with their save file directly (wren-save-to-ram=0). Essentially we could provide the same only-save-on-demand functionality Wren is known for while working entirely on disk (particularly useful on low-RAM or no-swap machines).

Additionally, multiple subvolumes could be used to save trees separately — e.g. save just the home directory contents to disk instead of the entire root tree. Then again, we could handle the same functionality by changing the rsync source and destination directories.

There is also the obvious advantage of using compression for save data.

codewithmichael commented 9 years ago

Release v0.2.0 introduced Btrfs as the backing file system for active data storage.

codewithmichael commented 9 years ago

Since the 0.2.0 release I've been experimenting with getting Wren running on an Apple MacBook Pro. After a couple of days of experimenting with GPT partitioning schemes and alternate Grub packages I got it working, but it VERY quickly died each time I fired it up.

Unfortunately there's no easy way to grab a copy of the dmesg/syslog since it all fails pretty quickly — but to summarize, Btrfs is failing and remounting the active filesystem read-only. The failure seems to have something to do with the multi-core processor.

I haven't yet duplicated the issue on a virtual machine, so I'm not sure under what other similar environments the problem would occur, but I found a few references on Launchpad and elsewhere about people encountering similar issues. Apparently fixes have been made in the 3.14.x and 3.18.x kernels, but unfortunately the newest kernel presently available in Ubuntu 14.04.x is 3.13.0-46.

I hate to say it, but we may have jumped the gun in making Btrfs the default active data file system. Perhaps we should change that for 0.3.0, but in the mean time, I think it would be best to provide an option to disable Btrfs, instead falling back to a plain old ext3/4 (ext4 might perform better in RAM, but shouldn't be used on flash devices). Sure, this would break snapshotting, but snapshotting wasn't working under LVM pre-0.2.0 anyway, and a plain ext file system has a much smaller overhead. I believe we can still perform online resizing with an ext4 disk image.

trynd / wren

Investigate ZFS and Btrfs #10