Closed hans-helmut closed 3 months ago
Thanks for pointing this out, I didn't realize that this max journal size could impose a limit on the performance. We can easily add an option to specify the size, and I'll also look into whether the default sizes still make sense.
Mikulas Patocka (dm-integrity author) has pointed out that the kernel currently has a hard coded max 64MB journal, which I then encoded in lvm. So, there's more involved to enlarge that. Have you tried bitmap mode? I've been thinking about changing the lvm default from journal to bitmap.
Thanks for the information about the 64MB limit. I was not aware, as integritysetup
allows a higher limit. For testing I run it above LVM:
~# integritysetup status /dev/mapper/dmint
/dev/mapper/dmint is inactive.
~# lvcreate --type raid1 -L 10G -n lv_for_int vg1
Logical volume "lv_for_int" created.
~# integritysetup format --journal-size 1073741824 /dev/vg1/lv_for_int
WARNUNG!
========
Hiermit werden die Daten auf »/dev/vg1/lv_for_int« unwiderruflich überschrieben.
Sind Sie sicher? (Tippen Sie 'yes' in Großbuchstaben): YES
Formatiert mit Etikettgröße 4 und interner Integrität crc32c.
Gerät wird gesäubert, um die Prüfsumme für die Integrität zu initialisieren.
Sie können diesen Vorgang mit Strg+C unterbrechen (der nicht gesäuberte Bereich des Geräts wird dann ungültige Prüfsummen haben).
Fertiggestellt, Zeit 04m25s, 8 GiB geschrieben, Geschwindigkeit 34,5 MiB/s
root@pluto:~#
~# integritysetup open /dev/vg1/lv_for_int dmint
~# mkfs.ext4 /dev/mapper/dmint
mke2fs 1.47.0 (5-Feb-2023)
Creating filesystem with 2341005 4k blocks and 586368 inodes
Filesystem UUID: 280686eb-ad68-42ff-a022-7ff562513722
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
~# mount /dev/mapper/dmint /dmint
~# cd /dmint
But the performance, at the first test, is very high, but decreases repeating the same test, probably as the journal needs to be written back:
/dmint# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=512m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
test: (groupid=0, jobs=1): err= 0: pid=55672: Thu Aug 1 22:25:37 2024
write: IOPS=171k, BW=668MiB/s (701MB/s)(512MiB/766msec); 0 zone resets
bw ( KiB/s): min=688888, max=688888, per=100.00%, avg=688888.00, stdev= 0.00, samples=1
iops : min=172222, max=172222, avg=172222.00, stdev= 0.00, samples=1
[...]
/dmint# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=1 --size=512m --readwrite=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][91.0%][eta 00m:44s]
test: (groupid=0, jobs=1): err= 0: pid=55680: Thu Aug 1 22:33:05 2024
write: IOPS=293, BW=1174KiB/s (1202kB/s)(512MiB/446484msec); 0 zone resets
bw ( KiB/s): min=294352, max=658712, per=100.00%, avg=476532.00, stdev=257641.43, samples=2
iops : min=73588, max=164678, avg=119133.00, stdev=64410.36, samples=2
[...]
So it "feels" like more than 64MB are possible.
I would prefer a journal over a bitmap, because bitmaps could not detect errors during power blackout. But the performance is better with bitmaps, in fact.
The distribution is Debian 12, Kernel 6.1.0 nearby. The raid1 contains a HDD and a SDD, tuned with ẁritemostly
.
Yes, I misunderstood the max size issue, we just need to add an lvm option to set it.
Hi @hans-helmut, can you give a bit more detail about your situation? What is the journal size that seems to be ideal in your case for performance with scattered writes? Do you know, or do you have to do more testing to figure out the right value once it's enabled in LVM?
Hello @jbaublitz,
I just want to store data, e.g. my fotos and contracts, for a longer time. As raid1 could not detect flipped bits (either on a drive or a data-bus), because only one side is read, I am afraid, that bad data may be copied into the backup. So I tried to add dm-integrity to the stack. I understood, that journaling is slower, because it writes twice with a delay of 10 seconds, but the only way to detect write-errors in case of a power failure, compared to the bitmap.
As long as the drives are faster than the network, the slow write is acceptable, but on random access HDDs get very slow. During testing I was wondering about about the high variation of the write speed with random writes. So I found out, that I (mis-)used the journal as linear cache for random writes. So I was thinking of increasing the journal to cache short high loads. As the VMs running on this partitions have only a few GB RAM, I would test with a few GB.
Nearby: As the bitmap-mode recalculates the checksum after a power-failure in the case, that one drive in a raid1 has correct data and checksum of a block and the other has wrong data and no checksum, after the correction and later some change, where the wrong block is read, changed and written, both drives have bad data. So some better integration is desirable, but increases the complexity.
Hi, are you able to compile and test a devel version of lvm that includes the new option --raidintegrityjournalsize here? https://gitlab.com/lvmteam/lvm2/-/tree/dev-dct-integrityjournalsize
It's initially restricting the journal size to between 4 and 1024 MiB, let me know if that seems reasonable.
Hello @teigland, thank you very much for the patch. I tested with a 1024 MB sized journal.
I works much better now with 64 GB. I can run a few times without any big delay, but then it gets very slow, from 10000 to 100 IOPS. It is difficult to reproduce.
My assumption is, that the delayed copy form the journal to "final" block needs some of the I/O-operations. I seems, when a kernel-thread named like kworker/3:0+dm-integrity-writer
starts, it gets slow. So it is not easy, to get reasonable values.
Hi, thanks for testing that. I wonder if some of the other dm-integrity tunable options might be useful, like journal_watermark and commit_time: https://docs.kernel.org/admin-guide/device-mapper/dm-integrity.html
I'm nearly finished with a patch to let lvm to configure those also which I'll post a link to. I decided to change the command line interface to use --integritysettings key=val
, so the --raidintegrityjournalsize
Here's a devel patch adding --integritysettings
that can be used to set several kernel tunables for integrity.
https://gitlab.com/lvmteam/lvm2/-/tree/dev-dct-integritysettings
pushed to main branch https://gitlab.com/lvmteam/lvm2/-/commit/78d14a805c3133c9a633a61c7751a81ebfae4d99
Hello!
While the option --raidintegritymode promises "improve performance for scattered writes", the crated size of the journal, up to only 64 MB, is too small.
So only small (from today's point of view) random writes up to max. 64 MB are improved.
Steps to reproduce:
Please offer an option to configure and resize the size of the journal for integrity.