ublue-os / bazzite

Bazzite is a cloud native image built upon Fedora Atomic Desktops that brings the best of Linux gaming to all of your devices - including your favorite handheld.
https://bazzite.gg
Apache License 2.0
4.09k stars 249 forks source link

Zram can be configured more optimally by using lz4 instead of zstd1 #1570

Open ahydronous opened 2 months ago

ahydronous commented 2 months ago

Describe the bug

Zram is currently configured to use zstd1, which is suboptimal

What did you expect to happen?

I've spent an inordinate amount of time optimizing zram on my system.

Benchmarks on zstd-1 vs lz4

https://www.reddit.com/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/

An explanation of vm.swappiness

https://docs.kernel.org/admin-guide/sysctl/vm.html#swappiness https://stackoverflow.com/questions/72544562/what-is-vm-swappiness-a-percentage-of

Overcommitting memory (zram being bigger than RAM size) is good

https://issuetracker.google.com/issues/227605780

Gathered wisdoms

IOPS benchmark on Samsung 970 EVO Plus 1TB

- lz4:      2 030 000 (!)
- zstd1:    820 000
- 970 EVO:  15 300

Compression ratios on mixed data

- lz4:    2.1
- zstd1:  2.9

This is very relevant for the Deck, because <12GB is right where in a lot of scenarios the benefits of extra memory from zstd1 start to outstrip the latency benefits of lz4. Valve probably has a lot of profiled data, but as far as I've been able to tell, even the heaviest games don't go much over 4GB of VRAM.

Swappiness

Swappiness can be derived via formula. On kernel.org, they state

For example, if the random IO against the swap device is on average 2x faster than IO from the filesystem, swappiness should be 133 (x + 2x = 200, 2x = 133.33).

You can reduce that to (yx = 200 -x), where y is filesystem-to-swap IO ratio. With the 970 Evo Plus as example again, we have aforementioned read IOPS values. 970EVO vs. lz4 = 15 300 / 2 030 000 = 0.008, so 0.008 is our ratio. We plug that in, 0.008x = 200 -x = 198.4, and we get vm.swappiness=198.

Page clusters

These are logarithmic. With zram, you get noticeable latency improvements with 1 page, vm.page-cluster=0

Writeback device (backing swap partition)

https://www.kernel.org/doc/html/v5.9/admin-guide/blockdev/zram.html#writeback Remember how I mentioned still needing a swapfile? Here is where it gets slightly more convoluted.

Extra

There is also secondary algorithm recompression, although I have not yet tried this out and it is only in the newer kernels. https://www.kernel.org/doc/html/latest/admin-guide/blockdev/zram.html#recompression

Output of rpm-ostree status

No response

Hardware

No response

Extra information or context

No response

KyleGospo commented 2 months ago

Thanks! Will be digging into this more, but for now: https://github.com/ublue-os/bazzite/commit/5ef67b4290e1bf083fad7beba959b31909c411c7

ahydronous commented 2 months ago

This should help a lot with understanding and tweaking various Virtual Memory settings @KyleGospo : https://gist.github.com/ahydronous/7ceaa00df96ef99131600edd4c2c73f2

fiftydinar commented 2 months ago

Question

What is (more) preferred?

My answer

Focus on lower latency without regression in bandwidth efficiency.

What are the best configuration values?

That generally depends on each PC configuration & usage scenario.

With the current approach, we cannot satisfy every usage scenario & PC configuration, because custom values are statically written only once during boot.

Examples

It is desirable to want more ZRAM swapiness during heavy usage scenario (bandwidth efficiency), while with light-medium usage scenario you want less ZRAM swapiness (lower latency).

It is desirable to want ZSTD ZRAM compression for low-RAM configurations (bandwidth efficiency), while with sufficient RAM configurations, you want LZ4 (lower latency).

etc, feel free to show more examples.

Implementation

I looked through @ahydronous's gist & I applied all values from there (except swapiness, where I use 180), to my custom image.

Here's how that looks:

Memory tweaks: https://github.com/fiftydinar/gidro-os/blob/b172d940c85cfa7a988010e2598281138674d290/files/0-system/usr/bin/memory-tweaks-gidro

https://github.com/fiftydinar/gidro-os/blob/b172d940c85cfa7a988010e2598281138674d290/files/systemd/system/memory-tweaks-gidro.service

Dirty centisecs: https://github.com/fiftydinar/gidro-os/blob/b172d940c85cfa7a988010e2598281138674d290/files/0-system/usr/bin/dirty-centisecs

https://github.com/fiftydinar/gidro-os/blob/b172d940c85cfa7a988010e2598281138674d290/files/systemd/system/dirty-centisecs.service

You can notice that MaxPerfWiz tries to adjust some dynamic memory values to be as ideal as possible for all configurations, like

This can be improved further.

Tuned can also dynamically change sysctl values depending on some scenarios, so that can also possibly work well.

EPOCHvoyager commented 1 month ago

As Bazzite is now shipping with TuneD, I took the liberty of creating a configuration for it with the values linked above; based on the TuneD balanced profile, for testing purposes. Do note that these values are relevant to this hardware specific setup - so, 16GB of RAM and a CPU with 16 threads:

/etc/tuned/profiles/balanced-tweaked/tuned.conf

#
# tuned configuration
#

[main]
summary = General non-specialized tuned profile with memory tweaks
include = balanced

[sysctl]
# Values taken from:
# https://gist.github.com/ahydronous/7ceaa00df96ef99131600edd4c2c73f2
# NOTE: Certain values are omited due to already being Bazzite defaults.
vm.dirty_background_bytes = 209715200
vm.dirty_bytes = 419430400
vm.vfs_cache_pressure = 66
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 250

# Convention says KiB of RAM * 0.01
# From MPW: RAM / num_of_logical_cores * 0.058
# 16777216 KiB / 8 threads * 0.058 = 121634
vm.min_free_kbytes = 60817
tduck973564 commented 1 month ago

Would it be possible to use a TuneD profile to adapt things like the algorithm and swappiness based on system specs?

EPOCHvoyager commented 1 month ago

Would it be possible to use a TuneD profile to adapt things like the algorithm and swappiness based on system specs?

I believe the way dynamic tuning works with TuneD might be coded into it directly, can't seem to find any word on configuring it. It's mainly meant for things like changing the CPU governor and things of that sort that change with system load, from what I gather.