Performance issues with sync=disabled + compress=zstd

RubenKelevra commented 1 month ago

System information

Type	Version/Name
Distribution Name	CashyOS (ArchLinux with optimizations)
Distribution Version	rolling
Kernel Version	6.9.x and 6.10.x
Architecture	x86_64
OpenZFS Version	2.2.4

Describe the problem you're observing

I've dug a bit through the open issues here, but it seems not be described yet.

I run ArchLinux (in CachyOS variant) on a ZFS root on a single pool with a fast 1 TB NVMe drive below.

The computer itself is an average ultra book from Samsung with an 11 gen Intel CPU/GPU processor and 16 GB of memory.

CachyOS comes pre-configured to run a compressed swap in memory and overall I don't see any memory pressure in daily operation. Meaning I got some GBs in the ARC and still 2-3 GB free.

The system feels extremely snappy and powerful, regardless what I wanna do.

However, this all changes if I do heavy operations on a subvolume which is configured to not sync its write operations.

Rationale behind to use no sync is to run latency sensitive operations more like on a memory disk, than with NVMe-speed.

I currently have e.g. ~/.cache and /tmp configured this way.

So applications can get rid of their data directly to memory and return, while the NVMe and the compression task in ZFS starts to catch up.

However, there seems to be an issue in the amount of work or data ZFS accepts in total or try to get finished in concurrent tasks:

Once I start compiling a larger program - in this case a browser - all the compression and write operations seems to bog down the system completely. My cursor froze for a minute or two on several occasions and the system might even run into a kernel panic.

That's quite weird, as if I set all subvolumes back to sync=standard this issue goes away.

My suspect here is a combination of memory clogging, due to a high number of accepted write operations, which need to be finished and the fact that ZFS appears to use 8 threads/processes on a processor with 4 physical cores to get rid of the data by compressing it in parallel tasks.

If I'm right the fix may be to reduce the maximum workload ZFS can do for compression, by reducing it to just 75% of the physical cores as well as having a tighter limit on how much data will be accepted for async writes at a time.

Not sure how the limit for accepting async reads/writes atm works, but I think a latency target instead of amount of data would work nicely for a large variety of systems.

Describe how to reproduce the problem

I hope this is reproduceable on other systems as well by just using ZFS as root and doing heavy write operations with sync=disabled and compression=zstd.

Include any warning/errors/backtraces from the system logs

Sadly the kernel panics haven't output any backtraces, as ZFS seems to accept no longer new data and there's no output on the screen.

rincebrain commented 1 month ago

My guess would be, most ultrabooks have CPUs with a very low limit on their power usage, so trying to saturate all cores is making it throttle very badly after it exceeds whatever threshold for bursting over power usage limits it has.

I don't think limiting it to fewer cores will help here, because it's still going to probably blow those limits and throttle for a while, if I'm right. (I believe you can ask perf to log events for doing things like this, but it's been a while since I did it, so I don't have an easy incantation offhand for you.)

The best suggestion I've got would be to not do that - if I'm right and the problem is spending too long doing high CPU load, then the only workaround is limiting how much it does per unit time.

You could, one imagines, lower the dirty data limit, which would, I believe, trigger a flush no matter what sync setting you've got if it's exceeded, but it might be hard to find a good limit for that.

RubenKelevra commented 1 month ago

Hey @rincebrain, actually the CPU reports a clock speed above-mentioned the base clock on all 8 cores if I look into atop.

Anyway, I don't think my cursor should freeze for minutes just because my CPU-speed drops.

rincebrain commented 1 month ago

Ultrabooks, generally, are designed to try and avoid hitting their thermal limits, because the tiny form factor means you can only do so much so fast after you do.

I'd suggest you record what the CPU is doing, exactly, and what the performance counters say about how it's stalling, and I would suspect you'll find that it's pinging either heat or power limit counters and clocking down, possibly briefly, possibly not. But even if I'm wrong, finding out what it's doing when it stalls out that way seems like the next step.

You could also try renicing the write taskqs, if you think it's just that work scheduled in the kernel is winning over userland, but the problem with that is, all the other Linux IO stuff runs at -19, last I checked, so you might have weird priority inversion outcomes, like userland CPU usage blocking the zstd processing threads.

In any case, I've given you 3 distinct suggestions on things you can investigate/test.

gmelikov commented 1 month ago

Could you tell exact cpu and laptop model? Some intels have big+little arch, I'm interested if it may affect in this way.

fwiw, on my laptop GPD win max 2 with Amd Ryzen 7840u (8 same zen4 cores 16threads) with 64gb ram, debian testing (6.7 kernel with 2.2.4) I use nearly extreme config:

txg_timeout 300s
sync=disabled on whole pool
home dir with native encryption
all datasets with lz4 compression
even used zstd for steam dir some time ago (and on sync reads it didn't give me any additonal lags)
standard tuning with xattr=sa atime=off

And I don't have any freezes or stalls ever on powersave mode. But yes, this exact laptop has pretty decent heatsink.

One year ago I used older laptop hp aero 13 with ryzen 5800u and 16g ram, experience was nearly the same. They both could give at least consistent 20w tdp for cpu.

RubenKelevra commented 1 month ago

Could you tell exact cpu and laptop model? Some intels have big+little arch, I'm interested if it may affect in this way.

Sure. It's not a big+little architecture:

i5-1135G7 @ 2.40GHz in Samsung Galaxy Book Flex2 5G

I did some compilation yesterday, and I noticed that the ARC basically drops all memory usage, despite having still 10 GB of memory free.

Screenshot_20240720_202244

You can also see, that even without having sync disabled, it runs in a load of 46, which is waaay to much to keep the system responsive. Most of the CPU-time is spend in sys instead of usr, where the compilation is happening. Leading me to believe that ZFS is trying to do too much work at the same time.

CPU however is still at 2.35 GHz of 2.4 GHz base clock, after an hour of load. So cooling isn't the issue here - I suppose.

Btw: How many threads is each of the compression processes using? I mean zstd is multi-threaded. Maybe each process is using 4 threads or so?

And I don't have any freezes or stalls ever on powersave mode. But yes, this exact laptop has pretty decent heatsink.

Yeah, but I don't think the the idea is that you need 16 threads and 64 GB memory to run a computer on ZFS without stalling / freezing.

Your system might have enough CPU power or memory to run into this issue.

RubenKelevra commented 1 month ago

Alright. I found a fix.

I set these kernel parameters:

zfs.zfs_arc_min=3221225472 zfs.zfs_arc_max=8589934592

load1 stayed around 11-12 instead of 47, the arc stayed between 3.2 and 4.7 GB instead of dropping completely, wait stayed at a solid 0, as well as iosome and iofull, while sys hovered around 60-80% sum (of 800% total).

System stayed completely usable, and mind you I set sync=disabled which either crashed or lagged the system before.

Compilation time went down from 1h 11min to 34m 53s.

Screenshot_20240721_161126

RubenKelevra commented 1 month ago

Hey @behlendorf, I think you did the taskq sheduling in aa9af22cdf8d16c197974c3a478d2053b3bed498 a couple of years back.

Maybe you can take a look at this again. I think there are just too many concurrent compression tasks running, which is bogging down the system. Somehow this leads zfs to believe that it needs to drop the arc nearly completely, which increases the IO further, as it's no longer cached.

The main difference between 2015 and now is probably, more memory intensive compression tasks going with zstd and that we don't use HDDs anymore, so the CPU tasks is the bottleneck now, not the HDD io. Which leads to the CPU tasks piling up.

Hope I analyzed that correctly.

Btw: How many threads does a single compression task with zstd create? Do they use the common multi threading settings of zstd, which would be 4 threads here, or are they single thread?

amotin commented 1 month ago

For compression ZFS has a number of z_wr_iss threads, covering 80% of CPU cores, controlled by zio_taskq_batch_pct module parameter. It is usually enough to reach sufficient speed, but same time not block the system completely for few seconds, similar you report here. But in the top outputs provided I see instead a bunch of z_rd_int threads, which handle checksuming and decryption on read completion. They should not handle compression/decompression, and it makes me wonder if aside of different compression algorithm you also changed checksum algorithm or enabled encryption. Few years ago before 7457b024ba2be2cf742e07239c20a1c3f3fa9c72, which was long after mentioned aa9af22cdf8d16c197974c3a478d2053b3bed498, to many z_rd_int threads combined with too expensive checksum and/or encryption could also block the system for a bit, but now they are also limited to 80% of CPU cores, so can block the system only if combined with something else.

amotin commented 1 month ago

So if you think the problem is in CPU saturation, instead of some unmotivated manipulations with ARC you should try reducing zio_taskq_batch_pct before importing pool.

RubenKelevra commented 1 month ago

@amotin wrote:

instead of some unmotivated manipulations with ARC

I fail to see how it's unmotivated. The arc usage dropped to just 430 MB, as you can see in the atop screenshot provided, despite having 10.6 GB of memory available.

I first wanted to see how the system behaves fixing the obvious issue of ZFS managing the arc size properly. But I agree, the different limits of the arc size only mask the real issue, which is task saturation by trying to do too much stuff at the same time.

@amotin wrote:

covering 80% of CPU cores

Physical cores or logical cores?

@amotin wrote:

They should not handle compression/decompression, and it makes me wonder if aside of different compression algorithm you also changed checksum algorithm or enabled encryption.

All my subvolumes use encryption=off and checksum=on.

@amotin wrote:

But in the top outputs provided I see instead a bunch of z_rd_int threads, which handle checksuming and decryption on read completion.

Correct me if I'm wrong, but I think these are processes, not threads.

amotin commented 1 month ago

@RubenKelevra wrote: But I agree, the different limits of the arc size only mask the real issue, which is task saturation by trying to do too much stuff at the same time.

ARC sizing on Linux is a huge can of worms on its own. If you expect strong unexpected memory pressure on your system and need ARC to cooperate, assuming the kernel versions you use are sane in reporting the pressure (see https://lore.kernel.org/all/20240711191957.939105-2-yuzhao@google.com/T/#u), I'd recommend to set zfs_arc_shrinker_limit=0 and possibly apply https://github.com/openzfs/zfs/pull/16197 .

covering 80% of CPU cores Physical cores or logical cores?

Logical.

@amotin wrote:

They should not handle compression/decompression, and it makes me wonder if aside of different compression algorithm you also changed checksum algorithm or enabled encryption. All my subvolumes use encryption=off and checksum=on.

Thinking again, I may be wrong, decompression may happen in z_rd_int when data are not speculatively prefetched, but read on demand. I am just not used to see much load there, since lz4 is very fast on decompression, while I am not sure about zstd, and prefetcher can often do the things, though not always.

@amotin wrote:

But in the top outputs provided I see instead a bunch of z_rd_int threads, which handle checksuming and decryption on read completion. Correct me if I'm wrong, but I think these are processes, not threads.

Process is typically a group of threads sharing the same address space. Since everything in kernel shares kernel address space, it does not matter. Lets call them kernel execution entities, consuming one CPU core each.

RubenKelevra commented 1 month ago

Thinking again, I may be wrong, decompression may happen in z_rd_int when data are not speculatively prefetched, but read on demand.

I might be wrong about this, but I thought data is always decompressed on demand, as the ARC keeps only the compressed data. At least since 0.7 that should be the case, no?

I am just not used to see much load there, since lz4 is very fast on decompression, while I am not sure about zstd

Rule of thumb: Decompression should be about 3 times as much CPU cycles per MB for zstd compared to lz4, memory usage is also higher.

I checked with a 8 MB file (compressed with zstd -3 and decoded it single thread, I end up at

2429.28 MB/s

But without turboboost I expect the speed to drop to half of that, per thread, maybe lower.

But it should be plenty fast, given that my SSD only outputs 3.4 GB/s linear, which is a lot lower if the blocks gets smaller.

Going off on a tangent: Kinda sad that we can't use dictionaries with zstd, that would massively improve compression and decompression speed, far beyond what lz4 does, while improving compression ratio even further. :)

Process is typically a group of threads sharing the same address space. Since everything in kernel shares kernel address space, it does not matter. Lets call them kernel execution entities, consuming one CPU core each.

Ah okay. I was just wondering if one process my be actually more than one thread, and thus would be able to use more than 1 core.

RubenKelevra commented 1 month ago

ARC sizing on Linux is a huge can of worms on its own. If you expect strong unexpected memory pressure on your system and need ARC to cooperate, assuming the kernel versions you use are sane in reporting the pressure (see https://lore.kernel.org/all/20240711191957.939105-2-yuzhao@google.com/T/#u), I'd recommend to set zfs_arc_shrinker_limit=0 and possibly apply #16197 .

That #16197 piqued my interest. Is this using the (somewhat) new PSI interface, that what's also giving out memsome and memfull in atop?

amotin commented 1 month ago

Thinking again, I may be wrong, decompression may happen in z_rd_int when data are not speculatively prefetched, but read on demand.

I might be wrong about this, but I thought data is always decompressed on demand, as the ARC keeps only the compressed data.

You are right, data is decompressed on demand. So if speculative prefetcher issues read, data will first lend compressed in ARC, and only some time later may be decompressed by demand user thread, or not. But if read is issued by user thread, ARC will get compressed buffer first, then immediately decompress it into dbuf and only then wake up waiting user thread waiting for the dbuf.

RJVB commented 1 month ago

Are we talking about building Chromium or Firefox? Even much smaller build jobs can create a significant I/O burden esp. if you're compiling with -g. How much better does the situation get if you use LZ4 compression (and how much does disk space usage grow)?

I've observed something similar years ago on the SSHD I was using back then, that setting sync=disabled on my compile work/scratch volume was counter-effective, also because ZFS was caching way too much data. Nowadays I use sync=disabled for most of my root-on-ZFS pool, except still the datasets that see regular, heavy building. Those have checksum=off though, which supposedly removes a tiny bit of overhead.

RubenKelevra commented 1 month ago

@RJVB wrote:

Are we talking about building Chromium or Firefox?

Nah, Ladybird is the Browser I'm building.

@RJVB wrote:

esp. if you're compiling with -g

Nope, I'm not:

CMAKE_BUILD_TYPE=Release

Full building instructions are here.

@RJVB wrote:

How much better does the situation get if you use LZ4 compression (and how much does disk space usage grow)?

I'm not interested in using LZ4.

Apart from that I found that a simple rg "$string" / 2>/dev/null will cause a freeze on the system, without the boot parameters.

I'm using the zfs.zfs_arc_min=3221225472 zfs.zfs_arc_max=8589934592 zfs.zfs_dirty_data_max_max=536870912 zfs.zio_taskq_batch_pct=13 now as boot parameters.

openzfs / zfs