hakavlad / nohang

A sophisticated low memory handler for Linux
MIT License
1.04k stars 49 forks source link

Interactions with MGLRU #122

Open LifeIsStrange opened 2 years ago

LifeIsStrange commented 2 years ago

Hi @hakavlad I have always much appreciated your various works for improving memory usage/system responsiveness (latency) or temporal throughput. That is why I was wondering what are your thoughts and maybe concerns about the MGLRU work? https://www.phoronix.com/scan.php?page=news_item&px=MGLRU-July-2022-Performance

As you can see from this slide https://www.phoronix.net/image.php?id=2022&image=mglru_v6 Are all of their claims true? I mean I doubt they compared their solution vs combinations of your user space solutions. Also I was wondering wether you'd be interested in developing or contributing to kernel space solutions to the memory problem? Maybe you have feedback for them or competing ideas? Most importantly I was wondering about the interactions and wether MGLRU will make nohang and your other solutions (which ones) obscolete for newer kernels with MGLRU? Overall this seems like a great development and should improve the situation on the Linux desktop :) (as unfortunately distros were too mediocre to integrate your solutions for the mainstream)

Note: crazy that they claim of no known regression https://openbenchmarking.org/embed.php?i=2206299-NE-MGLRUBENC78&sha=82868b37d91bc50bf73b05b09e80a54e5877c702&p=2

hakavlad commented 2 years ago

Hi @LifeIsStrange Thanks for the questions! You have asked many good and correct questions. A detailed answer to all of them would require writing an entire article.

LifeIsStrange commented 2 years ago

Well feel free to either: 1) do not take the energy to answer, as you do not owe me anything (and I actually owe you many avoided complete system freeze induced by Intellij Idea :)) or 2) give a quick and approximate/non-perfect answer or 3) write an entire article LOL Either way I wish you all the best!

hakavlad commented 2 years ago

I will try to gradually give responses, as far as I have enough strength.

LifeIsStrange commented 2 years ago

Any times, no worries it's no big deal :) If you knew the number of conversations I ghosted on github over the years 😅 (out of my own lazyness or by simply forgetting about it :))

hakavlad commented 2 years ago

I was wondering what are your thoughts and maybe concerns about the MGLRU work?

MGLRU is a revolutionary change. Revolutionary, because it changes the basics - the basics of the vmscan and vm shrinking.

I confirm that MGLRU allows you to increase performance under memory pressure.

I've seen this happen in at least one of my experiences: I've filled a list with random numbers until about half (or more) the swap space (swap on zram) is full, then overwritten the elements of the list with the new values [1]. This is a CPU-bound task that runs under memory pressure. In swap-intensive and energy-intensive tasks, indeed, 15-30% less energy consumption by kswapd0 and the same acceleration of tasks [2][3]. I could not achieve a similar increase in performance with the classic LRU. le9 patch and userspace tools do not allow to achieve such a result in the same tasks. I think this is the unique effect and the most valuable that MGLRU gives. All other things being equal, MGLRU allows you to squeeze the maximum out of hardware. That is why the MGLRU can be considered a more promising basis in comparison with the classic LRU.

To be continued (I have not announced the concerns yet).

  1. https://github.com/hakavlad/nohang-extra/blob/master/mem-load/mem-load (I'm too lazy to make this script production ready)
  2. https://www.linux.org.ru/forum/general/16321096?cid=16663844 Here I have reported on the results of testing MGLRU - this is a whole thread.
  3. I did not report this result in LKML, because the script is not production ready and I am generally too lazy.
LifeIsStrange commented 2 years ago

Thanks, great answer :) Reduced energy consumption is very meaningful, especially if this gets used on Android.

BTW this russian forum looks very interesting thx for sharing! this google translated comment is interesting:

However, this does not change the fact that there are loads with which the MGLRU cannot achieve the result achievable with the old LRU. These are the cache bound loads in case of lack of memory, with which you can achieve maximum performance with swappiness=200 with the old LRU. But, as it is known in narrow circles, the swappiness handling in MGLRU is different and looks like it is completely broken (with MGLRU, swappiness changes have little effect on the persistence of the cache when there is a shortage of memory

@hakavlad A cache is a performance critical algorithm that is relatively simple to change and has a tremendous impact. It's unfortunate that the choice here is either LRU or MGLRU. The java library Caffeine is the #1 state of the art cache in general purpose, because of its innovative algorithm: https://github.com/ben-manes/caffeine/wiki/Efficiency The algorithm is called w-tiny lfu https://github.com/ben-manes/caffeine/wiki/Efficiency#wikipedia and has disruptively better efficiency https://github.com/ben-manes/caffeine/wiki/Efficiency#wikipedia Someone might or not add the concept of generations (as in MGLRU) to a window-tiny lfu variant maybe? Anyway, benchmarking MGLRU vs W-tiny-lfu seems to me like a huge missed opportunity for performance and efficiency. If you agree, I'd appreciate if you shared my comment (in russian ?) to the Russian comment and call for those people to talk about it or contribute an implementation on LKML? The difference with MGLRU might be very signficiant and real world impacting. I won't do it myself as I have never contributed to linux.

http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html https://arxiv.org/abs/1512.00727v2

I'm pretty sure someone could very easily make a C implementation out of this C++ implementation https://github.com/mandreyel/w-tinylfu/blob/master/wtinylfu.hpp and then adapt the C port API to be used in the kernel. That someone could be you :)

LifeIsStrange commented 2 years ago

@mandreyel friendly ping if you think my last comment make sense, you might be able to make an extremely significant contribution to the linux kernel, that would increase Linux throughput while under memory/swap pressure and simultaneously reduce energy consumption for billions of users, for the century , by in theory:

1) merely porting your w-tiny lfu to C11, OR port this Facebook implementation 2) then to the Kernel as a replacement for its LRU or MGLRU and 3) benchmark like e.g. here: https://www.linux.org.ru/forum/general/16321096?cid=16741404 4) profits

@ben-manes sorry I couldn't resist pinging you :) cf my last comment. TL;DR this is about replacing the linux kernel LRU

ben-manes commented 2 years ago

A few years back, I looked into Linux's page cache, mm/workingset.c and tried to simulate their policy, DClock. The policy's configuration is a guess based on the total system memory and is pretty much arbitrary from discussing with the authors. They believe anyone who has paging issues can add more memory, so they only cared about and optimized for when the page cache is already 90%+ effective. The simulations show that there is a healthy room for improvement.

The authors of MGLRU say that "Multigeneration lru is not related to mm/workingset.c. The filename might be a bit misleading because it doesn't really track resident set size (in memory), only a recent eviction history." I'm not sure what the relationship is between the two policies. The Linux code is scattered and confusing.

LifeIsStrange commented 2 years ago

Very interesting @ben-manes !

I'm not sure what the relationship is between the two policies

I have unfortunatelo no clue :m, and were are not alone BTW according to your simulation LIRS seems to perform as well as tiny LFU. since DClock design is inspired by LIRS, I wonder why they made it the default since it significantly underperform LIRS? ARC has (had?) patent issues but is that the case for LIRS? Anyway Linux has become patent proof https://openinventionnetwork.com/

ben-manes commented 2 years ago

LIRS is an excellent algorithm but the papers are very confusing, making it difficult to write a correct implementation. It is not a concurrent policy, but their ClockPro algorithm is. There are LWN articles on their attempts at ClockPro, but it seemed a lack of a champion and code complexity led to them inventing their own inspired policy. The code maintainability is more important than its performance, so their choices make sense given what they were comfortable with. In none of their attempts did they seem to try and simulate beforehand, so code dumps from transcribing the reference material does seem very spooky.

Caffeine avoids the concurrency problems by using a lesser known trick of replay buffers (also from the LIRS authors). That makes it much simpler to reason about and optimize the algorithms for a single-threaded context. The choice of W-TinyLFU vs LIRS was mostly a toss-up where maintainability made the choice. LIRS was very confusing to write/debug and harder to fix it deficiencies, whereas improving TinyLFU appeared to be straightforward engineering iterations. (Those problems were later fixed by the authors in LIRS2).

The Java simulator has my reimplementations and the various papers are fun to read. A lot of design choices comes down to what the author is comfortable to support, so if the algorithm is too magical or poorly described then it won't get much attention. Caffeine's approaches are easy to explain, implement, and debug, so it's gotten more adoption.

LifeIsStrange commented 2 years ago

Great comment, as always. Note: feel free to ignore my comment, it's mostly off-topic stuff that I'm interested in lol Oh never heard of LIRS2! Wonder if it can outperform tinyLFU/Caffeine.

off-topic: digression

The code maintainability is more important than its performance

Well I have to somewhat disagree on that here, generally yes but it shouldn't be a false dilemma, the linux kernel should receive enough funding/human resources to maintain a SOTA cache implementation (doesn't even have to be too complex as Caffeine show but it could and should be afforded), this artificial scarcity of engineering talent (very few engineers read academic papers like you do or are fluent in cpu intrinsics/SIMD) is as said, artificial. It is a cultural issue and a hiring issue. All GAFAs depends on the critical hot paths of the kernel and the linux foundation has a $177 Million dollar budget of which only a mere 3% are allocated to the kernel. So as said, it is an artificial, accidental scarcity that affects ends users, but this phenomenon happens everywhere (major advances in NLP are ignored thanks to NIH syndrome, in medecine many diseases including the aging process have found potent treatments since the 90s and keep being ignored, etc etc)

off-topic: digression I wonder what you think about how cache Neural Networks/learned data structures compare to Caffeine

giga-off-topic: Hope y'all have seen the groundbreaking James Webb pictures :)

ben-manes commented 2 years ago

You can try read the paper and source code. The paper wasn't detailed enough and I did not try to port their C++ code to the Java simulator. It would be nice to have that, though, as there should be useful insights to learn from.

At the time I ran a few of my tests through it after using the rewriter utility to convert formats. That led to their last commit as it was originally unusable due to slow pruning. From the email thread with the authors, it matched DS1 and sounded like I found iti competitive in most workloads. Unfortunately it failed on my adaptive stress test. They claimed to have a workload where Caffeine underperformed, but were bound by an NDA and the referenced holders refused to grant me access unfortunately.

The ML-based policies are very slow, at best measures per hit operations in microseconds rather than a few nanoseconds. They do not test against a wide range of data, compare against modern non-ML policies, and have high resource overhead. In general they need to be trained which makes them brittle to application evolution. Given the complexity, lack of code, and targeting a non-engineering community, I don't have much of an opinion on their success. In that HN thread I mentioned the RC-Cache paper which is the closest to TinyLFU and seems promising. In that case I think it would be an NN showing one can make an admission policy smarter (e.g. latency aware), and then a simpler algorithmic solution could be found that would likely outperform it. To me ML caches are a good way to find out if the computer can brute force to a smarter algorithm and then let engineers design one, but are not what engineers would deploy directly.

hakavlad commented 2 years ago

call for those people to talk about it or contribute an implementation on LKML?

I can call @yuzhaogoogle.

@yuzhaogoogle What do you think about the possibility of replacing LRU with W-TinyLFU on Linux?

I won't do it myself as I have never contributed to linux.

@LifeIsStrange It's easy to ask any questions at linux-mm and LKML. Feel free to do it!

I'm pretty sure someone could very easily make a C implementation out of this C++ implementation and then adapt the C port API to be used in the kernel. That someone could be you :)

I'm not a programmer. I would like to propose this mission to @yuzhaogoogle.

Probably Yu Zhao can explain why this is a bad or hard to implement idea.

yuzhaogoogle commented 2 years ago

Note: crazy that they claim of no known regression https://openbenchmarking.org/embed.php?i=2206299-NE-MGLRUBENC78&sha=82868b37d91bc50bf73b05b09e80a54e5877c702&p=2

It's not crazy if we have set up a fresh environment just to rerun those CPU intensive benchmarks with the latest kernel/patchset and found no actual regressions.

yuzhaogoogle commented 2 years ago

@yuzhaogoogle What do you think about the possibility of replacing LRU with W-TinyLFU on Linux?

I don't want to discourage anybody, but to be honest, it's unrealistic given the amount of engineering effort required.