zfs_abd_scatter_enabled: slow reads and very high CPU load

shodanshok commented 2 years ago

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.9
Kernel Version	3.10.0-1160.36.2.el7.x86_64
Architecture	x86_64
OpenZFS Version	2.0.5-1

Describe the problem you're observing

On a Linux/KVM host, sequentially reading some disk image files resulted in subpar performance and very high CPU load. At the same time, setting zfs_abd_scatter_enabled=0 or zfs_prefetch_disable=1 (ie: disabling scatter list or prefetch) fixed both the performance and CPU load issue. As an example:

# default settings
# see the very high cpu load counted against dd
echo 0 > /sys/module/zfs/parameters/zfs_prefetch_disable (default)
 9945 root      20   0  124496  16808    308 R 100.0  0.1   3:19.10 dd if=fileserver_data.img of=/dev/null bs=16M status=progress

# prefetch disabled
# no more high cpu load
echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
 9945 root      20   0  124496  16848    348 D  27.9  0.1   4:32.89 dd if=fileserver_data.img of=/dev/null bs=16M status=progress

I produced some flame graphs showing much activity in memory ZFS memory allocation/reclaim and so I tried disabling ABD scatter list, with even better results compared to a system with a disabled prefetcher.

default: default

noprefetch:

noscatter:

The described issue seems very significant on machines with fragmented memory and high uptime - the KVM host being quite busy and having KSM enabled.

Describe how to reproduce the problem

On a machine with highly fragmented memory, try reading a fragmented files with default settings (with ABD scatter list and prefetch enabled). See how the reading process is slowed down by the very high CPU load.

Include any warning/errors/backtraces from the system logs

Ukko-Ylijumala commented 2 years ago

KSM probably is a big factor in this equation, as it will by nature fragment memory and cause lots of CoW events. If possible, please disable KSM and unshare all pages with echo 2 >/sys/kernel/mm/ksm/run and rerun the tests.

shodanshok commented 2 years ago

@Ukko-Ylijumala disabling KSM on that specific host is a tall order. That aside, at the read speed afforded by 2x 2-way mirrored HDD (less than 200MB/s) I would be quite surprised that this is expected behavior from ABD.

For now, I simply disabled scatter ABD via zfs_abd_scatter_enabled=0. What issues can I expect by not using scatter list? Thanks.

Ukko-Ylijumala commented 2 years ago

There are so many possible interactions with all the different MM layers that you kind of must take KSM out of the picture if you expect to actually test anything relevant. Performance issues caused by KSM do exist and I wouldn't be surprised if this ends up being one more on that list.

At the core KSM is memory deduplication. Most of the same issues that plague the ZFS implementation are similarly present with KSM, but since it works in the MM layer, it will affect many more areas of the system.

behlendorf commented 2 years ago

I'd just add that the provided flame graphs (thanks for those!) indicate the additional CPU time is being spent in the kernel MM layers. Quite possibly this is due to KSM being enabled. It may just be that switching to zfs_abd_scatter_enabled=0 has the effect of preventing KSM from doing any meaningful amount of memory deduplication.

shodanshok commented 2 years ago

@Ukko-Ylijumala @behlendorf well, I tried disabling KSM during off-hours and I can confirm that the issue was solved, even without un-sharing all the already-shared pages. What bugs me is that as by KSM documentation

KSM only merges anonymous (private) pages, never pagecache (file) pages ... KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call

I understand that ARC is not integrated into the kernel pagecache, but can it be excluded by the KSM process? After all, I suppose ARC does not madvises itself as mergeable. Can anything be done to let ARC better play with KSM (which is an important feature for a virtualization host)?

Thanks!

shodanshok commented 2 years ago

@behlendorf what are the implications of running with zfs_abd_scatter_enabled=0? Will the system be less stable/reliable/fast/etc. or it is a memory usage optimization things? Thanks.

behlendorf commented 2 years ago

@shodanshok when running with zfs_abd_scatter_enabled=0 you'll generally see a higher level of memory fragmentation. This can result in the ARC using the memory less efficiently for caching depending on the workload. This can have an impact on performance, but clearly as illustrated by the KSM discussion above, there are a lot of factors which can effect things. Changing this shouldn't effect stability or reliability.

Can anything be done to let ARC better play with KSM

That's a good question. Getting the ARC memory to be treated by KSM more like the page cache memory is something we should look in to.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

jittygitty commented 1 year ago

@ryao In my question to you by email (and https://github.com/openzfs/zfs/discussions/14630 ) about what advantages we might have by being able to use the kernel's page cache, besides more general performance or efficiency gains curiosities, I was also wondering about "KSM".

Because I've heard that some users have tremendous gains RAM-wise when they have a lot of similar VMs running at the same time, so not keen on disabling it etc.

And I should add, the "main compelling" reason I found for ZFS versus BTRFS is that at least so far, BTRFS was useless to me since I'd need to disable all the CoW features to use VMs or databases, while by contrast ZFS said sure, np, just do it!

So I concluded BTRFS is good for synology/NAS data backup stores, almost cold storage, but no good for hosting and snapshotting your many VMs/databases etc.

All that to say, I 'hope' that ZFS will continue to make sure we don't lose this huge advantage, that zfs will continue to be performant hosting VMs and databases etc.

snajpa commented 7 months ago

We're observing that when a system has been running for a few weeks, /proc/buddyinfo is heavily calmed down after setting zfs_abd_scatter_enabled to 0. Otherwise there is really a huge turnover in the higher orders and a lot of memory requests end up in direct reclaim and compaction. Turning off zfs_abd_scatter_enabled on most impacted server - on Wed at midnight:

shodanshok commented 7 months ago

@behlendorf Going back into commit history, it seems adding scatter was mainly motivated by (lack of) kernel virtual address space. Is that true even for modern 64-bit kernel? Or does it only affect 32-bit builds?

I ask because performance seems much higher when using linear buffers, without scatter. As a quick example, done on a small 1-core, 4 GB RAM Debian 12 x86-64 virtual machine, reading a 1 GB random-filled file:

# scatter enabled (zfs_abd_scatter_enabled=1, default)
# zpool export tank; zpool import tank; for i in `seq 1 5`; do dd if=/tank/test/test.img of=/dev/null bs=128K; done
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.797141 s, 1.3 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.280788 s, 3.8 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.272253 s, 3.9 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.276013 s, 3.9 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.274512 s, 3.9 GB/s

# scatter disabled (zfs_abd_scatter_enabled=0)
# zpool export tank; zpool import tank; for i in `seq 1 5`; do dd if=/tank/test/test.img of=/dev/null bs=128K; done
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.782439 s, 1.4 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.182341 s, 5.9 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.178743 s, 6.0 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.172294 s, 6.2 GB/s
8192+0 records in
8192+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.169356 s, 6.3 GB/s

Notice how cached speed is 50% higher when scatter buffers are not used. Thanks.

shodanshok commented 7 months ago

Possibly related to https://github.com/openzfs/zfs/issues/15385

amotin commented 7 months ago

@shodanshok When you enable linear allocation, ZFS starts sharing buffers between ARC and dbuf cache. It allows to avoid one memory copy on both reads and writes, that is why you see performance improvement, especially when your active data set size is bigger than dbuf cache, but smaller than ARC. But after that ZFS may suffer from KVA fragmentation, that up to my knowledge haven't gone anywhere. But you must run system for some time under stress using different block sizes to experience it. I.e. it works great until it does not.

shodanshok commented 7 months ago

@amotin What it is not very clear to me is what happens when the system suffer from KVA fragmentation. Does it "simply" means that the various linear buffers are allocated in fine-grained, physically discontinued chunks, impacting performance due to missing spatial locality? Or system stability as a whole can be impaired?

From the old commit history comments, it seems that large vmalloc allocations on 32-bit kernels where discouraged by the kernel community. But 64-bit kernels have such a bigger kernel virtual address space that I fail do see how this can be a real issue now - except for excessive physical fragmentation of the otherwise linearly allocated buffer.

Being the difference so significant (less memory copies are very important for NVMe pools), and due to the complex code required for scatter, I am tempted to run some pools with scatter disabled - but I feel I am missing something.

Thanks.

amotin commented 7 months ago

@shodanshok I am not sure how much KVA exactly Linux allocates. FreeBSD defaults to equal to physical memory size, Illumos/Solaris -- IIRC double physical memory size. On FreeBSD it is tunable, but you can not set it absolutely huge, since it requires page tables to be allocated for that range, that cost actual memory. Last time I saw the overflow on FreeBSD it ended up in kernel panic. How exactly it ends up on Linux I am not sure. But the problem with address space fragmentation is that no matter how much KVA you allocate, there is a chance to have it overflowed. For example, if you need 16MB virtually contiguous block, it may happen that your system has some 4KB allocated after every 15.9MB, and there is just nothing sequential for 16MB. Use of scatter ABD allows system to allocated the 16MB is multiple chunks, down to at little as 4KB granularity. And the cost of that is, unfortunately, the additional memory copies.

shodanshok commented 7 months ago

@amotin I can be wrong, but from here I get that vmalloc area on a 64-bit kernel is 32 TB - which is huge. So while one can imagine a very fragmented KVA where 16 MB allocation are not possible, it seems a very remote outcome. Hence my feeling about missing something important about KVA on Linux kernels... The point about TLB usage is very valid, thanks for mentioning it.

openzfs / zfs