lvmteam / lvm2

Mirror of upstream LVM2 repository
https://gitlab.com/lvmteam/lvm2
GNU General Public License v2.0
133 stars 73 forks source link

Caching not supported with Integrity RAID #92

Open timwhite1 opened 2 years ago

timwhite1 commented 2 years ago

I've been getting errors when attempting to add a Cache volume to an Integrity RAID volume.

"Command on LV volgroup/mirror_lv is invalid on LV with properties: lv_is_raid_with_integrity . Command not permitted on LV volgroup/mirror_lv."

Parsing the Command Line Tools, this appears to be a built-in rule, that will always reject Caching configs on existing Integrity LVs -- currently lines 477-536: https://github.com/lvmteam/lvm2/blob/master/tools/command-lines.in

Is there a history of data corruption or other undesirable behavior in this setup?

I would really like to gain the advantages of dm-integrity in combination with dm-cache in the universal LVM framework. This will help make a solid alternative to ZFS in a universally accessible toolset.

Thank you for your help!

teigland commented 2 years ago

Hi, the limitation was an effort to introduce the raid+integrity feature incrementally and wait to see how it worked and how widely used it would be in the basic form. We have had very little to no feedback about how much or how well raid+integrity is used in the real world (not that we have reliable was of measuring this or getting feedback unfortunately.) If you have any experience using it, good or bad, we'd be interested to hear about it.

timwhite1 commented 2 years ago

That makes sense. Slow change is normally safest. I'm using the Integrity mirror feature now on a simple 2 disk RAID 1. But it's very fresh so I can't give quality feedback yet.

As a workaround for the LVM caching restriction I added a Stratis Pool on top of this Integrity Mirror Logical Volume, which allowed me to apply read cache. Unfortunately Stratis doesn't allow Write-through cache yet (only Write-back). So no real performance benefit to offset the impact of dm-integrity during writes.

The drives are Seagate Exos 7k 10TB and can normally sustain 250-300 MB/s read or write with a standard (non-integrity) mirror. But with Integrity applied and initial sync complete, the volume caps out at 45-56 MB/s on writes. I have an NVMe volume providing the Stratis caching which is helping the pool hit sustained benchmarks at 2-15GB/s in disk manager with sub ms latency. But its unclear when write-back Caching will become configurable, leading me back to pure LVM.

Is there a reasonable way to bypass the current LVM restriction which blocks Cache on Integrity RAID volumes? I'm happy to test this setup as well. Just really needing to be able to hit higher write performance to keep the Integrity setup.

Thanks for your help!

teigland commented 2 years ago

Here are some completely unsupported and untested steps to manually create the metadata for the LV layers you want to try. The result appears to activate properly for me. I don't know why it wouldn't work, but it could still destroy all the data, so use for testing purposes only. Let us know how it works, thanks.

This example is using three PVs: two "slow" disks to use in
a raid1 config, and one "fast" disk to use for caching.

$ vgcreate test /dev/sdb /dev/sdc /dev/sdg
  Physical volume "/dev/sdb" successfully created.
  Physical volume "/dev/sdc" successfully created.
  Physical volume "/dev/sdg" successfully created.

1. Create a raid+integrity LV.  You'll later add a writecache layer
   onto this LV.

$ lvcreate --type raid1 --raidintegrity y -L1G -n rr test /dev/sdb /dev/sdc
  Creating integrity metadata LV rr_rimage_0_imeta with size 20.00 MiB.
  Logical volume "rr_rimage_0_imeta" created.
  Creating integrity metadata LV rr_rimage_1_imeta with size 20.00 MiB.
  Logical volume "rr_rimage_1_imeta" created.
  Logical volume "rr" created.

2. Create a plain raid1 LV of the same size.  This is temporary, and is
just used to create some metadata structures that will be used when editing
below.  This space will deleted by the edits, can use any disks.

$ lvcreate --type raid1 -L1G -n ss test /dev/sdb /dev/sdc
  Logical volume "ss" created.

3. Create a fast LV that will be used as the cache, e.g. on the ssd,
   sdg here.

$ lvcreate -n fast -L512M test /dev/sdg
  Logical volume "fast" created.

4. Combine the fast LV with the temporary raid1 LV.

$ vgchange -an test
  0 logical volume(s) in volume group "test" now active

$ lvconvert --type writecache --cachevol fast test/ss
Erase all existing data on test/fast? [y/n]: y
  Using writecache block size 4096 for unknown file system block size, logical block size 512, physical block size 512.
  WARNING: unable to detect a file system block size on test/ss
  WARNING: using a writecache block size larger than the file system block size may corrupt the file system.
Use writecache block size 4096? [y/n]: y
  Logical volume test/ss now has writecache.

5. The result is that ss is writecache+raid1, and rr is raid1+integrity.
The next steps will manually edit the metadata to move the ss writecache
layer onto rr.  The raid1 portion of ss will be deleted.

$ lvs -a test
  LV                   VG   Attr       LSize   Pool        Origin              
  [fast_cvol]          test Cwi---C--- 512.00m                                                                        
  rr                   test rwi---r---   1.00g                                                                        
  [rr_rimage_0]        test gwi---r---   1.00g             [rr_rimage_0_iorig]                                        
  [rr_rimage_0_imeta]  test ewi-------  20.00m                                                                        
  [rr_rimage_0_iorig]  test -wi-------   1.00g                                                                        
  [rr_rimage_1]        test gwi---r---   1.00g             [rr_rimage_1_iorig]                                        
  [rr_rimage_1_imeta]  test ewi-------  20.00m                                                                        
  [rr_rimage_1_iorig]  test -wi-------   1.00g                                                                        
  [rr_rmeta_0]         test ewi---r---   4.00m                                                                        
  [rr_rmeta_1]         test ewi---r---   4.00m                                                                        
  ss                   test Cwi---C---   1.00g [fast_cvol] [ss_wcorig]                                                
  [ss_wcorig]          test rwi---C---   1.00g                                                                        
  [ss_wcorig_rimage_0] test Iwi---r---   1.00g                                                                        
  [ss_wcorig_rimage_1] test Iwi---r---   1.00g                                                                        
  [ss_wcorig_rmeta_0]  test ewi---r---   4.00m                                                                        
  [ss_wcorig_rmeta_1]  test ewi---r---   4.00m            

6. Manual metadata editing to work around the lvm command restrictions.

$ vgcfgbackup test
  Volume group "test" successfully backed up.

$ cp /etc/lvm/backup/test test-new

$ vi test-new
. delete the sections: ss_wcorig_rmeta_0, ss_wcorig_rmeta_1,
  ss_wcorig_rimage_0, ss_wcorig_rimage_1
. delete the contents of the ss_wcorig{} section.
. move the contents of the rr{} section into the ss_wcorig{} section
. delete the empty rr section
. replace all instances of string rr_ with ss_wcorig_
. remove "VISIBLE" from the ss_wcorig status line

7. Write the new metadata to disk.
   (No LVs in the VG should be active when doing this.)

$ vgcfgrestore -f test-new test

8. The result should look like this:
$ lvs -a test
  LV                         VG   Attr       LSize   Pool        Origin                     
  [fast_cvol]                test Cwi---C--- 512.00m                                                                               
  ss                         test Cwi---C---   1.00g [fast_cvol] [ss_wcorig]                                                       
  [ss_wcorig]                test rwi---C---   1.00g                                                                               
  [ss_wcorig_rimage_0]       test gwi---r---   1.00g             [ss_wcorig_rimage_0_iorig]                                        
  [ss_wcorig_rimage_0_imeta] test ewi-------  20.00m                                                                               
  [ss_wcorig_rimage_0_iorig] test -wi-------   1.00g                                                                               
  [ss_wcorig_rimage_1]       test gwi---r---   1.00g             [ss_wcorig_rimage_1_iorig]                                        
  [ss_wcorig_rimage_1_imeta] test ewi-------  20.00m                                                                               
  [ss_wcorig_rimage_1_iorig] test -wi-------   1.00g                                                                               
  [ss_wcorig_rmeta_0]        test ewi---r---   4.00m                                                                               
  [ss_wcorig_rmeta_1]        test ewi---r---   4.00m           

9. Activate test/ss which should now be writecache+raid1+integrity.

dmsetup table should show that
test-ss is writecache
test-ss_wcorig is raid
test-ss_wcorig_rimage_0 is integrity
test-ss_wcorig_rimage_1 is integrity

Note that if if you want to try dm-cache instead of dm-writecache, then
you'd use use lvconvert --type cache in step 4, and the manual editing
would use strings "corig" instead of "wcorig".
timwhite1 commented 2 years ago

I'll keep you posted on testing. Thank you for your help and the detail you put into this workaround! I really appreciate it!!

timwhite1 commented 1 year ago

@teigland I have been struggling a bit to implement and test this workaround.

I'm trying to implement via an NVMe Integrity RAID 1 for resilient Writecache attached to an HDD Integrity RAID 1 for backing disk.

I can understand the custom meta data setup for a single non-integrity drive attached to integrity mirror backing disks. But I'm struggling with how to assemble when the Writecache volume is also an integrity mirror.

Could you help with that example? I'd love to help test enough for this to eventually become a supported LVM setup if it works well.

Thank you again for your help!

teigland commented 1 year ago

lvm requires the cachevol to be a linear LV (currently an internal lvm limitation for the sake of keeping things manageable.) I'm skeptical of using raid+integrity for the cachevol (or anything more complex than linear really). I doubt that dm-writecache would accept it (I haven't checked), and even if it did it would probably be too slow since the point is to be fast. Handling failures would probably be too complex to be reliable, and in general the complexity of such a stack seems too high.

timwhite1 commented 1 year ago

Ah I see. So can a cache vol even be a RAID at all (apart from integrity)? Somehow I thought it could. I'm hesitant to use Writecache mode on a single cache vol.

I have extremely high performing NVMe drives that I'm using so the performance hit is still very reasonable. I'm just trying to stay LVM but get integrity and caching benefits without reverting to ZFS or BTRFS and Bcache.

What I'm finding is that the complexity is high due to required workarounds right now. But my hope is to eventually see things get more simplified if the testing goes well for myself and others and the rule sets gets opened up. The current LVM rules are setting a very high bar just to test these kinds of setups.

As an alternative is there a kernel or module I could use that has less restrictive caching / integrity rules? Right now I'm on Alma Linux 9.1 and would love to stay RHEL based or UEK. But I can move to Ubuntu if needed.

teigland commented 1 year ago

Ah I see. So can a cache vol even be a RAID at all (apart from integrity)? Somehow I thought it could. I'm hesitant to use Writecache mode on a single cache vol.

No, lvm limits the cachevol to be a linear device. dm-writecache will let us use a raid device, so you could apply this trivial patch to use a raid device (or even raid+integrity) as the cachevol. All I can say is that I'm able to create and activate an LV like this, but no idea how it works otherwise. (I need to think about how we might enable experimental features in lvm without committing to full support for them, so we could have people try out untested things like this.)

diff --git a/tools/lvconvert.c b/tools/lvconvert.c
index 8888cac28c..0650866536 100644
--- a/tools/lvconvert.c
+++ b/tools/lvconvert.c
@@ -6170,11 +6170,6 @@ int lvconvert_writecache_attach_single(struct cmd_context *cmd,
                        goto bad;
                }

-               if (!seg_is_linear(first_seg(lv_fast))) {
-                       log_error("LV %s must be linear to use as a writecache.", display_lvname(lv_fast));
-                       goto bad;
-               }
-
timwhite1 commented 1 year ago

Thank you @teigland for the explanation and for your work to provide this patch. I really appreciate it.

I definitely understand not wanting to make unsupported setups an open arena for users to start reporting bugs on. There's enough work in a day just making supported setups work as expected.

I'm learning a good amount as I experiment. And I really like the idea of being able to stay in LVM even if it requires a kernel patch to test and implement.

** Edit -- please disregard these questions below, I found the lines that diff was noting to remove in lvconvert.c around line 6173 and I found the lv_fast references all over the existing code - just a variable reference. https://github.com/lvmteam/lvm2/blob/master/tools/lvconvert.c

_Is this something I should be able to compile on box, referencing source files typically already available or is this something I need to pull down first? (a/tools/lvconvert.c and b/... references).

Also is the "lvfast" reference used internally already in lvconvert.c or is that a hard coded LV name I need to reference when calling lvconvert after applying the patch?

Thank you again for your time. Once I see this work for a bit I will share my build setup with more experimenters and hopefully open some new options in the future for high performance integrity caching using native LVM tools in any distro.

teigland commented 1 year ago

It sounds like your experiment is to use only lvm, but if you're flexible, it's more practical and most common to use lvm with mdraid. I've not really seen mdraid used with integrity, but if you skip integrity for the writecache cachevol, and put the cachevol on md raid1, then you could avoid patching and rebuilding lvm.

e.g. create /dev/md0 raid1 on /dev/nvme1 and /dev/nvme2 pvcreate /dev/md0 lvcreate -n main --type raid1 --raidintegrity y -L vg /dev/sda /dev/sdb lvcreate -n fast -L vg /dev/md0 lvconvert --type writecache --cachevol fast test/main

timwhite1 commented 1 year ago

I think your example will work well for me. Thank you for your help!

I'm definitely fine with making mdraid part of the stack. My main goal is to have a high performing integrity RAID using native tools, in a setup that can work on basically any kernel and, after significant testing, be very stable.

Right now LVM and mdraid are pretty universal even if not glamorous. Neither are likely to break even on the latest kernel updates. Neither require dkms or kmod packages. And both will still mount automatically even with a failed disk, which seems like a reasonable expectation for a raid set.

With LVM Writecache or DM-writecache the same data doesn't live long term on the cache volume anyway. It's continually overwritten after flushing to backing disks and never referenced for reads. I'm not sure if it is even checked for repeat writes. So where I was really pushing for integrity raid even on the cache vol, there's very little possibility integrity Writecache could ever provide much value, if any at all. For read-cache or Write-back it could be more important with the repeat usage and longer potential lifetime, but I'm not planning to use those here.

I'll try this and keep you posted.

teigland commented 1 year ago

Sorry for not paying close enough attention here, the last steps I suggested you use will still run into the limitation that you originally pointed out (and point of the metadata hacking I outlined.) I don't really think that limitation is necessary, and we should relax that upstream. You can recompile lvm with this one line removed to lift that limitation, under the command definition for lvconvert --type writecache --cachevol LV LV_linear_striped_raid_thinpool

diff --git a/tools/command-lines.in b/tools/command-lines.in index 4bbafd05dd..a9c9c22b24 100644 --- a/tools/command-lines.in +++ b/tools/command-lines.in @@ -501,7 +501,6 @@ OO: OO_LVCONVERT, --cachesettings String ID: lvconvert_to_writecache DESC: Attach a writecache to an LV, converts the LV to type writecache. RULE: all and lv_is_visible -RULE: all not lv_is_raid_with_integrity

timwhite1 commented 1 year ago

Got it - thank you for catching that detail for me!

When I recompile lvconvert.c do you recommend just keeping it as a separate executable, for example lvconvert-mod?

Or is there a way (and do you recommend) actually replacing the built in system binary?

teigland commented 1 year ago

Got it - thank you for catching that detail for me!

When I recompile lvconvert.c do you recommend just keeping it as a separate executable, for example lvconvert-mod?

Or is there a way (and do you recommend) actually replacing the built in system binary?

lvm commands all run from the single lvm binary, which you can usually run from the source tree without installing it on the system.

git clone git://sourceware.org/git/lvm2.git lvm-test.git cd lvm-test.git

edit tools/command-lines.in, removing the RULE line above

here's a set of configure options you can try based on the fedora lvm2 package, with lvmlockd and vdo removed to simplify build requirements:

./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --runstatedir=/run --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm --with-default-pid-dir=/run --with-default-locking-dir=/run/lock/lvm --with-usrlibdir=/usr/lib64 --enable-fsadm --enable-write_install --with-user= --with-group= --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --enable-pkgconfig --enable-cmdlib --enable-dmeventd --enable-blkid_wiping --with-udevdir=/usr/lib/udev/rules.d --enable-udev_sync --with-thin=internal --with-cache=internal --enable-lvmpolld --enable-dbus-service --enable-notify-dbus --enable-dmfilemapd --with-writecache=internal --with-integrity=internal --with-default-use-devices-file=1 --disable-silent-rules --enable-app-machineid --enable-editline --disable-readline

make cd tools ./lvm lvcreate ...

timwhite1 commented 1 year ago

This is very helpful, thank you!

So to run the modified ruleset binary in the example above I would just enter the new directory and run any caching setup / change commands from there?

cd lvm-test.git\tools

lvcreate lvconvert lvchange etc...

And outside of that directory any of the same commands would run from the built in unmodified binary?

Thank you again for your help, this is a new realm for me.

teigland commented 1 year ago

Yes, you'd need to run the updated commands through the lvm binary you compile, e.g. cd tools ./lvm lvcreate ... ./lvm lvs ./lvm lvchange ...

timwhite1 commented 1 year ago

Thank you. I will test with this method and keep you all posted.

t0fik commented 1 year ago

@teigland I've got about half of write performance on raid10 w/ integrity compared to raid10 w/o integrity on four Seagate Exos 7E8 8TB ST8000NM000A disks HW spec:

IMHO Disabled snapshots for raids w/ integrity might be blocker for adoption, getting consistent backup is more convenient if you have snapshot capability

timwhite1 commented 1 year ago

@t0fik If you're getting even 50% write performance on an LVM integrity RAID without LVM write cache that is actually pretty good. There is a major penalty for the integrity checks. This thread is related to be lowering the barrier for cache enablement in an integrity RAID Volume. LVM Write Cache in particular makes this not just usable, but an ideal setup: no kernel modules or dkms requirements, simple OS updates, works on basically every distro, and in the event of a disk failure the array still mounts automatically without user intervention.

In my experience I was often getting much less than 50% write performance which is why I opened the discussion. I'm still in the middle of dependency challenges trying to re-compile the workaround LVM2 binaries. But I hope eventually to see Integrity RAID caching become natively supported without these special workarounds.

t0fik commented 1 year ago

@timwhite1 I'm aware of topic of this thread. @teigland wrote, about lack of feedback, so I've gave one. Also I've wrote my opinion why raids w/ integrity adoption might be low, which might be cause of little to no feedback. I'm considering disabling data integrity because snapshots are not supported.

@t0fik If you're getting even 50% write performance on an LVM integrity RAID without LVM write cache that is actually pretty good. There is a major penalty for the integrity checks. This thread is related to be lowering the barrier for cache enablement in an integrity RAID Volume. LVM Write Cache in particular makes this not just usable, but an ideal setup: no kernel modules or dkms requirements, simple OS updates, works on basically every distro, and in the event of a disk failure the array still mounts automatically without user intervention.

Thank you, it was simple test of sequential writes of large files (50GiB). You are 100% right.

In my experience I was often getting much less than 50% write performance which is why I opened the discussion. I'm still in the middle of dependency challenges trying to re-compile the workaround LVM2 binaries. But I hope eventually to see Integrity RAID caching become natively supported without these special workarounds.

I've just migrated to RAID w/ integrity, so my results from real live scenarios might be similar.

I should have some time this week, so maybe I'll produce rpm on copr with patches disabling rules for cachepool and thinpool. I'll post a link to repo in this thread when it's done.

timwhite1 commented 1 year ago

Thank you @t0fik! That would be a big help to other testers as well to have an available patched LVM2 rpm for integrity cache config testing. I'm primarily testing on Rocky Linux 9 now and can see this initiative as a great new solution for various storage solutions.

Thank you for the notes on snapshots as well.

One workaround if you don't mind additional new product testing might be to import your LVM integrity (or integrity + cache) volume into Stratis Storage. Stratis has snapshot functionality and amazing read cache performance if you pair with one or more SSD / NVMe volumes. (Stratis does not yet have a solution for write cache or write back cache). I know this adds to the complexity of the stack and it would be more ideal to do native LVM snapshots.

Thank you for testing and helping others to test as well!

teigland commented 1 year ago

Thanks for the feedback. I recently finished the testing that was needed to enable caching on raid+integrity, and I've just pushed the change to the main branch: https://sourceware.org/git/?p=lvm2.git;a=commit;h=390ff5be2fd9dda47f2bfd2db96de64acc925002

Enabling snapshots with this is what I'm currently working on (developing the tests.)

The performance issues are not entirely surprising. One likely improvement is to use "--raidintegritymode bitmap" when adding integrity, which should be faster than the default journal mode. I need to look into making bitmap the default mode.

Another possibility is using "--raidintegrityblocksize 4096". (An existing fs on the LV may need to be recreated to use a 4k fs block size.)

It would be interesting to hear if either setting has a noticeable impact.

Also use "lvs -a -o+devices" to check that each physical device is only used by one raid image.

timwhite1 commented 1 year ago

Thank you for the bitmap and block size tips @teigland. I will compare both in testing as well.

And by the way this is awesome (the upstream commit)!

t0fik commented 1 year ago

I was able to build packages for Fedora https://copr.fedorainfracloud.org/coprs/tofik/lvm2/ Sadly chroot for RHEL9 on copr does not work, due to problems with registration, spec file with patches is here: https://gitlab.com/tofik-rpms/lvm2 Helper scripts to build rpms are here: https://gitlab.com/tofik-rpms/helpers

timwhite1 commented 1 year ago

@t0fik Thank you for your work on this! This will be very helpful. So it looks like for now testing on Fedora is all we have easily available.

@teigland If we build rpms on Fedora 34 does it seem like those might be able to run on RHEL 9?

t0fik commented 1 year ago

@timwhite1 RPM is building for RHEL/Rocky/Alma9 but not on Fedora COPR vanila chroot. I was able to build packages locally inside RockyLinux9 container with the same spec file.

With some shenanigans (attaching devel repositories from Rocky) to provide build dependencies to RHEL-9 chroot, I was able to build packages for el9 systems. They are available in the same COPR.

timwhite1 commented 1 year ago

That's great @t0fik thank you for your work on this!