Feature-request: Configurable size of journal for ingegrity

hans-helmut commented 1 month ago

Hello!

While the option --raidintegritymode promises "improve performance for scattered writes", the crated size of the journal, up to only 64 MB, is too small.

So only small (from today's point of view) random writes up to max. 64 MB are improved.

Steps to reproduce:

~# lvcreate --type raid1 --raidintegrity y -L 10G -n int vg1
  Creating integrity metadata LV int_rimage_0_imeta with size 148,00 MiB.
  Logical volume "int_rimage_0_imeta" created.
  Creating integrity metadata LV int_rimage_1_imeta with size 148,00 MiB.
  Logical volume "int_rimage_1_imeta" created.
  Logical volume "int" created.
~# mkfs.ext4 /dev/vg1/int 
mke2fs 1.47.0 (5-Feb-2023)
Creating filesystem with 2621440 4k blocks and 655360 inodes
Filesystem UUID: e36e1fd4-1d71-4c45-a4b8-e50005b00281
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

~# mkdir /int
~# mount /dev/vg1/int /int
~# cd /int
/int# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=32m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=16507: Tue Jul 30 22:47:21 2024
  write: IOPS=15.5k, BW=60.5MiB/s (63.4MB/s)(32.0MiB/529msec); 0 zone resets
   bw (  KiB/s): min=44320, max=44320, per=71.55%, avg=44320.00, stdev= 0.00, samples=1
   iops        : min=11080, max=11080, avg=11080.00, stdev= 0.00, samples=1
[...]

/int# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=48m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][71.4%][eta 00m:02s]                          
test: (groupid=0, jobs=1): err= 0: pid=16516: Tue Jul 30 22:47:33 2024
  write: IOPS=3108, BW=12.1MiB/s (12.7MB/s)(48.0MiB/3953msec); 0 zone resets
   bw (  KiB/s): min=    8, max=57048, per=92.08%, avg=11449.60, stdev=25490.29, samples=5
   iops        : min=    2, max=14262, avg=2862.40, stdev=6372.57, samples=5
[...]

/int# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=64m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][60.5%][eta 00m:15s]                          
test: (groupid=0, jobs=1): err= 0: pid=16523: Tue Jul 30 22:48:04 2024
  write: IOPS=727, BW=2909KiB/s (2979kB/s)(64.0MiB/22531msec); 0 zone resets
   bw (  KiB/s): min=    8, max=62248, per=100.00%, avg=6536.84, stdev=17356.11, samples=19
   iops        : min=    2, max=15562, avg=1634.21, stdev=4339.03, samples=19
[...]

Please offer an option to configure and resize the size of the journal for integrity.

teigland commented 1 month ago

Thanks for pointing this out, I didn't realize that this max journal size could impose a limit on the performance. We can easily add an option to specify the size, and I'll also look into whether the default sizes still make sense.

teigland commented 1 month ago

Mikulas Patocka (dm-integrity author) has pointed out that the kernel currently has a hard coded max 64MB journal, which I then encoded in lvm. So, there's more involved to enlarge that. Have you tried bitmap mode? I've been thinking about changing the lvm default from journal to bitmap.

hans-helmut commented 1 month ago

Thanks for the information about the 64MB limit. I was not aware, as integritysetup allows a higher limit. For testing I run it above LVM:

~# integritysetup status /dev/mapper/dmint
/dev/mapper/dmint is inactive.
~# lvcreate --type raid1 -L 10G -n lv_for_int vg1
  Logical volume "lv_for_int" created.
~# integritysetup format --journal-size 1073741824 /dev/vg1/lv_for_int 

WARNUNG!
========
Hiermit werden die Daten auf »/dev/vg1/lv_for_int« unwiderruflich überschrieben.

Sind Sie sicher? (Tippen Sie 'yes' in Großbuchstaben): YES
Formatiert mit Etikettgröße 4 und interner Integrität crc32c.
Gerät wird gesäubert, um die Prüfsumme für die Integrität zu initialisieren.
Sie können diesen Vorgang mit Strg+C unterbrechen (der nicht gesäuberte Bereich des Geräts wird dann ungültige Prüfsummen haben).
Fertiggestellt, Zeit 04m25s,    8 GiB geschrieben, Geschwindigkeit  34,5 MiB/s
root@pluto:~# 

~# integritysetup open /dev/vg1/lv_for_int dmint
~# mkfs.ext4  /dev/mapper/dmint
mke2fs 1.47.0 (5-Feb-2023)
Creating filesystem with 2341005 4k blocks and 586368 inodes
Filesystem UUID: 280686eb-ad68-42ff-a022-7ff562513722
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

~# mount  /dev/mapper/dmint /dmint 
~# cd /dmint

But the performance, at the first test, is very high, but decreases repeating the same test, probably as the journal needs to be written back:

/dmint# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=512m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=55672: Thu Aug  1 22:25:37 2024
  write: IOPS=171k, BW=668MiB/s (701MB/s)(512MiB/766msec); 0 zone resets
   bw (  KiB/s): min=688888, max=688888, per=100.00%, avg=688888.00, stdev= 0.00, samples=1
   iops        : min=172222, max=172222, avg=172222.00, stdev= 0.00, samples=1

[...]
/dmint# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=512m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][91.0%][eta 00m:44s]
test: (groupid=0, jobs=1): err= 0: pid=55680: Thu Aug  1 22:33:05 2024
  write: IOPS=293, BW=1174KiB/s (1202kB/s)(512MiB/446484msec); 0 zone resets
   bw (  KiB/s): min=294352, max=658712, per=100.00%, avg=476532.00, stdev=257641.43, samples=2
   iops        : min=73588, max=164678, avg=119133.00, stdev=64410.36, samples=2
[...]

So it "feels" like more than 64MB are possible.

I would prefer a journal over a bitmap, because bitmaps could not detect errors during power blackout. But the performance is better with bitmaps, in fact.

The distribution is Debian 12, Kernel 6.1.0 nearby. The raid1 contains a HDD and a SDD, tuned with ẁritemostly.

teigland commented 1 month ago

Yes, I misunderstood the max size issue, we just need to add an lvm option to set it.

jbaublitz commented 1 month ago

Hi @hans-helmut, can you give a bit more detail about your situation? What is the journal size that seems to be ideal in your case for performance with scattered writes? Do you know, or do you have to do more testing to figure out the right value once it's enabled in LVM?

hans-helmut commented 1 month ago

Hello @jbaublitz,

I just want to store data, e.g. my fotos and contracts, for a longer time. As raid1 could not detect flipped bits (either on a drive or a data-bus), because only one side is read, I am afraid, that bad data may be copied into the backup. So I tried to add dm-integrity to the stack. I understood, that journaling is slower, because it writes twice with a delay of 10 seconds, but the only way to detect write-errors in case of a power failure, compared to the bitmap.

As long as the drives are faster than the network, the slow write is acceptable, but on random access HDDs get very slow. During testing I was wondering about about the high variation of the write speed with random writes. So I found out, that I (mis-)used the journal as linear cache for random writes. So I was thinking of increasing the journal to cache short high loads. As the VMs running on this partitions have only a few GB RAM, I would test with a few GB.

Nearby: As the bitmap-mode recalculates the checksum after a power-failure in the case, that one drive in a raid1 has correct data and checksum of a block and the other has wrong data and no checksum, after the correction and later some change, where the wrong block is read, changed and written, both drives have bad data. So some better integration is desirable, but increases the complexity.

teigland commented 1 month ago

Hi, are you able to compile and test a devel version of lvm that includes the new option --raidintegrityjournalsize here? https://gitlab.com/lvmteam/lvm2/-/tree/dev-dct-integrityjournalsize

It's initially restricting the journal size to between 4 and 1024 MiB, let me know if that seems reasonable.

hans-helmut commented 1 month ago

Hello @teigland, thank you very much for the patch. I tested with a 1024 MB sized journal.

I works much better now with 64 GB. I can run a few times without any big delay, but then it gets very slow, from 10000 to 100 IOPS. It is difficult to reproduce.

My assumption is, that the delayed copy form the journal to "final" block needs some of the I/O-operations. I seems, when a kernel-thread named like kworker/3:0+dm-integrity-writer starts, it gets slow. So it is not easy, to get reasonable values.

teigland commented 1 month ago

Hi, thanks for testing that. I wonder if some of the other dm-integrity tunable options might be useful, like journal_watermark and commit_time: https://docs.kernel.org/admin-guide/device-mapper/dm-integrity.html

I'm nearly finished with a patch to let lvm to configure those also which I'll post a link to. I decided to change the command line interface to use --integritysettings key=val , so the --raidintegrityjournalsize will become --integritysettings journal_size= etc.

teigland commented 1 month ago

Here's a devel patch adding --integritysettings that can be used to set several kernel tunables for integrity. https://gitlab.com/lvmteam/lvm2/-/tree/dev-dct-integritysettings

teigland commented 1 month ago

pushed to main branch https://gitlab.com/lvmteam/lvm2/-/commit/78d14a805c3133c9a633a61c7751a81ebfae4d99

lvmteam / lvm2

Feature-request: Configurable size of journal for ingegrity #151