phuslu / lru

High performance LRU cache
MIT License
182 stars 6 forks source link

A few questions about your implementation #4

Closed maypok86 closed 5 months ago

maypok86 commented 5 months ago

Hi @phuslu, very interesting work, but some things in your repository raise questions for me.

Throughput benchmarks

  1. No need to run benchmarks on a very busy github runner, as its cores are forced to switch context frequently.
  2. b.SetParallelism(32) is also unnecessary, since you run more goroutines than the number of available cores and the scheduler ends up spending a lot of CPU time switching between them.
  3. Ristretto is configured incorrectly, which greatly affects its performance. You specify in its parameters MaxCost: 2 << 30, while specifying 1 as the cost of each item. Because of this, it will store 2 << 30 items, which is much more than one million. The correct configuration is as follows
    cache, _ := ristretto.NewCache(&ristretto.Config{
    NumCounters: 10 * cachesize, // number of keys to track frequency of (10M).
    MaxCost:     cachesize,      // maximum cost of cache (1M).
    BufferItems: 64,             // number of keys per Get buffer.
    })

    Otter is also configured incorrectly, but it's not critical for now. But it will be soon!🙂

And now if you run the original benchmarks without runtime limit on 8 free perfomance cores with the command go test -run='^$' -cpu=8 -bench . -timeout=0 -benchmem

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/bench
BenchmarkCloudflareGetSet-8     19329662                59.74 ns/op           16 B/op          1 allocs/op
BenchmarkEcacheGetSet-8         30167113                42.04 ns/op            2 B/op          0 allocs/op
BenchmarkLxzanGetSet-8          19655550                64.18 ns/op            0 B/op          0 allocs/op
BenchmarkFreelruGetSet-8        16393722                76.60 ns/op            0 B/op          0 allocs/op
BenchmarkRistrettoGetSet-8      13675732                91.52 ns/op           29 B/op          1 allocs/op
BenchmarkTheineGetSet-8          7859980               154.2 ns/op             0 B/op          0 allocs/op
BenchmarkOtterGetSet-8          19729608                70.02 ns/op            8 B/op          0 allocs/op
BenchmarkPhusluGetSet-8         23397532                52.11 ns/op            0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/bench 17.342s

After other fixes we get the following

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/bench
BenchmarkCloudflareGetSet-8     18841335                61.23 ns/op           16 B/op          1 allocs/op
BenchmarkEcacheGetSet-8         28833619                39.66 ns/op            2 B/op          0 allocs/op
BenchmarkLxzanGetSet-8          24512806                50.89 ns/op            0 B/op          0 allocs/op
BenchmarkFreelruGetSet-8        14905558                79.83 ns/op            0 B/op          0 allocs/op
BenchmarkRistrettoGetSet-8      22861460                52.04 ns/op           28 B/op          1 allocs/op
BenchmarkTheineGetSet-8          7006892               171.7 ns/op             0 B/op          0 allocs/op
BenchmarkOtterGetSet-8          30895054                47.92 ns/op            7 B/op          0 allocs/op
BenchmarkPhusluGetSet-8         26106796                47.16 ns/op            0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/bench 17.371s

Here you should make sure that all implementations use the same number of shards, but I'm too lazy to do it anymore.

  1. In benchmarks you use this construct i := int(fastrandn(cachesize)); i <= threshold. Do you really want to always write the same items and read the same items? Because that is exactly what you are doing.
  2. Next are general questions about key distribution used in benchmarks. The main one is why a unique key distribution is used. It is completely unrealistic and distributes contention across shards, which hides implementation problems. And why the cache is only half full, do you really want to check the speed of cache misses, which are worthless unlike cache hits? You can read more here And if you run more CPU cache-aware benchmarks the use of sync.Mutex will start to have an impact.
    goos: darwin
    goarch: arm64
    pkg: github.com/maypok86/otter/perf
    BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               11608960               107.7 ns/op         9283558 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            48676160                25.32 ns/op       39493565 ops/s              16 B/op          1 allocs/op
    BenchmarkCache/Zipf_Phuslu_reads=100%,writes=0%-8               32948250                34.86 ns/op       28687550 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                287808117                4.196 ns/op     238324807 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10141712               106.3 ns/op         9405869 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            27029410                46.52 ns/op       21498335 ops/s              27 B/op          1 allocs/op
    BenchmarkCache/Zipf_Phuslu_reads=75%,writes=25%-8               33284792                33.89 ns/op       29508730 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                190370718                6.527 ns/op     153221398 ops/s               2 B/op          0 allocs/op
    BenchmarkCache/Zipf_Theine_reads=50%,writes=50%-8               13183764                88.23 ns/op       11334143 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Ristretto_reads=50%,writes=50%-8            22789605                52.02 ns/op       19224501 ops/s              34 B/op          1 allocs/op
    BenchmarkCache/Zipf_Phuslu_reads=50%,writes=50%-8               32335388                34.77 ns/op       28761246 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Otter_reads=50%,writes=50%-8                106489442               12.22 ns/op       81857267 ops/s               6 B/op          0 allocs/op
    BenchmarkCache/Zipf_Theine_reads=25%,writes=75%-8               13120154                92.32 ns/op       10832209 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Ristretto_reads=25%,writes=75%-8            14231132                70.28 ns/op       14228276 ops/s              60 B/op          1 allocs/op
    BenchmarkCache/Zipf_Phuslu_reads=25%,writes=75%-8               32800737                33.55 ns/op       29807980 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Otter_reads=25%,writes=75%-8                51792225                23.23 ns/op       43054563 ops/s              12 B/op          0 allocs/op
    BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               31050308                41.97 ns/op       23827702 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8            13311984               128.6 ns/op         7775102 ops/s             122 B/op          3 allocs/op
    BenchmarkCache/Zipf_Phuslu_reads=0%,writes=100%-8               33116732                34.05 ns/op       29370837 ops/s               0 B/op          0 allocs/op
    BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                10602829               120.2 ns/op         8316114 ops/s              64 B/op          1 allocs/op
    PASS
    ok      github.com/maypok86/otter/perf  30.320s

Memory usage benchmarks

  1. Ristretto is misconfigured again.
  2. Apparently your not storing the key twice unlike other implementations, then using values taking up less space than keys doesn't look very fair (it saves about 15 MB)
  3. Also I didn't find any data structure for deleting expired items. This for these benchmarks saves about 16 bytes per item or about 15MB in total, and will also in real life lead to much higher memory usage than the other implementations. Oh, and not being able to disable clock goroutine looks sad.
  4. Comparing memory usage with ristretto, theine, otter doesn't look particularly fair, because due to cost based eviction support, the hash table grows many times.
  5. Many implementations use a few extra bytes to increase hit ratio. For example, theine and otter should use about 8-16 bytes extra per element, but unfortunately this is 16-24 bytes due to alignment issues😔.

So your implementation does show good memory utilization, but after fixing the implementation and benchmarks the gap will not be so big.

General questions

  1. Do you understand why your implementation is faster than ristretto? This is important because cutting the eviction policy is a very dangerous operation that will end badly in a large number of cases.
  2. I tried to simulate the hit ratio of this implementation and found out some unpleasant subroutines. This one looks very questionable, because in real life you will store many more items than you need. Also, here are the results of the hit ratio simulation. Almost everywhere too many items are used, but the most confusing thing is the performance on S3, are there definitely no bugs?
phuslu commented 5 months ago

Thanks for touching.

I corrected two obvious mistakes

  1. ristretto config
  2. shards count

For other points of you mentioned, I need do more study and thoughts.

phuslu commented 5 months ago

Throughput benchmarks

  1. Although github actions results is not very stable, but I think it's trustable. We could check all history results in https://github.com/phuslu/lru/actions/workflows/benchmark.yml , it show the benchmark variance is not big and the performance index is stable.
  2. I set b.SetParallelism(32) intentionally and think it's reasonable, because I'd like to simulate a highly concurrent web handler.
  3. corrected.
  4. You mean I need call fastrandn(cachesize) again to get another index for reading/writing?
  5. Half full is a initial state. And I suppose/assume the benchmark result will be same in different fullness
phuslu commented 5 months ago

Memory usage benchmarks

  1. Corrected.
  2. I think this is phuslu/lru advantages because the the surface of benchmark code does not shows favoritism to it (I think)
  3. Yes, but this is a choose/tradeoff in each implementation. (E.g. If I implementat a resizable table&list, the get/set performance will be hitted)
  4. Similar with 3.
  5. Similar with 3.
phuslu commented 5 months ago

General questions:

  1. I didn't get your point well, you means the lackness of a active/background eviction is not acceptable?
  2. Yes it's a known issue, I plan to try optimize the overuse by this trick https://github.com/golang/go/blob/f719d5cffdb8298eff7a5ef533fe95290e8c869c/src/runtime/rand.go#L223 . For clock gorouting, I think I can add another atomic uint32 as a stop flag, thanks! And for bug, I cannot say it's bugfree, but this implementation is really simple, welcome code review&audit.
phuslu commented 5 months ago

Oh, I think I got a bit your point

Q: Why choose to implement a fixed size cache without active/background eviction? A: I'm a fan of fastcache philosophy, so I'd like to get a "generics" version alternative to fastcache with a simple LRU strategy

maypok86 commented 5 months ago

Throughput benchmarks

1 Maybe, but in the end the results of such runs don't inspire much confidence, especially since already now in github actions phuslu/lru loses to go-freelru. Especially since the numbers in benchmarks are just huge for such operations.

2 We'd all like to test this, but the bottom line is that you can't execute more than the number of goroutines in parallel than the number of available CPU cores physically. Which makes you check the speed of the scheduler, although you still want to see the speed of the caches.

4 (and 5) I don't think I explained it well. Let's do it again. First, you will never insert an element with i > threshold. Threshold = cachesize / 10, and the cache would fill for all i < cachesize / 2. The end result is that all set operations are updates, and half of get operations are cache misses, which is pretty sad considering that cache misses are cheaper than cache hits, and with updates phuslu/lru just doesn't waste time maintaining the expiration policy. The second thing I've also run into is that using a distribution from unique keys for benchmarks is pretty bad at reflecting real-world behavior. Usually in real life the workload at least remotely resembles a zipf distribution with hot and cold elements. In your benchmarks, because of the unique key distribution, there is no difference between using sync.Mutex and sync.RWMutex, because the fastpath is always triggered. Anyway, the key question here is "Why are you trying to test on unique keys?". Really very interesting because I can't find any decent explanation. There is also a problem with fastrand, because in the first iteration you can have i = 16, and in the second i = 999_999 and so on, which prevents the cpu/compiler from utilizing the full potential of your hardware, but okay.

maypok86 commented 5 months ago

Memory benchmarks

2 Yes, it's an advantage, no one is even against that advantage🙃. The only problem is that in your benchmarks keys take up more memory than values, which benefits phuslu/lru and otter, but in real life values usually take up more space than keys.

3 Ok, although that's pretty sad considering facebook and twitter's statements about expiration policies. Oh and memory usage comparisons don't look fair then.

maypok86 commented 5 months ago

General questions

  1. No, what I meant was that under attack or with a cache capacity comparable to the number of shards it would result in a very small hit ratio. And this will cost a lot of system uptime.
  2. To be honest, I wouldn't want to try to find a bug in the implementation. I'll just say that on S3 the behavior is very strange, the hit ratio didn't grow for a very long time and then skyrocketed 3 times, becoming more than the optimal value for this trace and cache capacity. With a high probability it indicates that there are too many elements in the cache. Why the hit ratio didn't grow, I didn't understand. I didn't really see how cheaprand would help solve this, but okay. The fastcache philosophy is great for offheap caches, but here you have to realize that the pointerless lru implementation is still in the heap.
maypok86 commented 5 months ago

Perhaps my questions may seem picky, but I really want to make better comparisons and learn something new. I've given up trying to make memory usage comparisons honest, but throughput benchmarks and hit ratio simulation I believe we can try to make them honest, although rounding cache sizes make that difficult too.

phuslu commented 5 months ago

Throughput benchmarks

1 github actions is cheap(trigger on each push) and visible, I'd like to keep it.

2 I changed to concurrency to 8(same as the cpu number).

4 (and 5) Understood, I give a quick/temp fix by call fastrand twice, will try simplify this case soon . And for reason I choose "unique key test"(I call it randomly read/write) because it's easier to implemented and I think it's fair to major go lru caches. I agree that it's not fair/honest to RTO(Ristretto/Theine/Otter), let me add zipf benchmark later.

phuslu commented 5 months ago

Memory benchmarks

2 Agree, but I think I followed the common test styles/cases on the "market".

3 Emmm, this does set up a restricted scenario for phuslu/lru, but it's really everyone's choice

phuslu commented 5 months ago

General questions

  1. Every shards+mutex implementation has this defect/issue(And the low performance of zipf), but I think this is a choose/trade. shards(with mutex) is slow than RCU in zipf distribution but It has potential memory/gc benefits (if we implement it carefully).
  2. althrough "the pointerless lru implementation is still in the heap", but thanks to "pointerless/continuous/compact", the most operation will be a "memcpy" instead of implicit memory allocation. That's why I claim it is gc-friendly.
phuslu commented 5 months ago

Many of the questions you raised are very insightful and let me learn/study. One of motivation that I spend a lot of time to tweak/revise this implementation is: I'd like show that the traditional implementation (shards + lru) still have its value if we implement it carefully.

phuslu commented 5 months ago

The initial motivation https://twitter.com/phuslu/status/1743236300977922502 and during the implementation I was greatly influenced by fastcache philosophy.

phuslu commented 5 months ago

I checked your one of benchmark results here https://github.com/maypok86/benchmarks/blob/main/ratio/ratio.txt the Ristretto results is very abnormal , any reason/root cause?

maypok86 commented 5 months ago

Throughput benchmarks

1 This was actually the main reason why I came at all. Unfortunately the github actions don't look like cheap, at least because of the huge latency in benchmarks. Especially since I have a suspicion that you were given much less than 8 CPU cores, because for server CPUs those are very weak numbers.

2 It is not required and even harmful, as you can make sure of it just by opening the SetParallelism description SetParallelism sets the number of goroutines used by RunParallel to p*GOMAXPROCS. There is usually no need to call SetParallelism for CPU-bound benchmarks. If p is less than 1, this call will have no effect.

4 (and 5) Yes, such a comparison for lru implementations makes sense in principle, but here it is the amount of work the implementation does that comes to the fore. And in the end it's the number of features that slow down the implementation but give more guarantees of convenience to users that decides everything. This is quite controversial, but okay, let's consider it a tradoff. Although as a user then I would like to know from the README about the lack of an expiration policy for example, because for ttl it is usually implied. To be honest, I don't really care about RTO performance on such benchmarks, challenging them was not part of my motivation for this issue. Especially since on 8 free cores their performance is much better. I was rather interested to know why people choose an unrealistic distribution that hides problems.

maypok86 commented 5 months ago

I checked your one of benchmark results here https://github.com/maypok86/benchmarks/blob/main/ratio/ratio.txt the Ristretto results is very abnormal, any reason/root cause?

This is a fairly well known issue, which was the main reason for the creation of otter. There are already many issues about ristretto's abysmal performance on their own hit ratio benchmarks (first and second). And the authors do not respond to it in any way unfortunately. If you don't want to run big benchmarks, you can just run this one. Even on such a simple zipf distribution the ristretto performance is abysmal.

maypok86 commented 5 months ago

Memory benchmarks

2 and 3 A little controversial, but basically okay.

General questions

  1. The problem there isn't zipf workloads, it's RTOs spending time synchronizing a common eviction policy for all shards, but using some tricks to speed up reads. My main argument here was just that you have to be careful with cutting eviction policies, as it can be reflected in hit ratio. If this is known, then all is well.
  2. Makes sense, I rather meant here that when storing strings for example, you will no longer achieve the compactness of fastcache and put a pressure on gc.
phuslu commented 5 months ago

Throughput benchmarks

  1. OKay, I partly agree with you but I really need a "cheap" x86_84 server to run and visualize these benchmarks. (background: my laptop is windows and my linux server is oracle cloud arm machine)
  2. Disabled the SetParallelism
  3. (and 5) Studying and re-writing zipf cases based on your benchmark suite
maypok86 commented 5 months ago

P.S. It would be cool if you tried running hit ratio tests for ristretto. Maybe me and the author of theine have developed schizophrenia😂, but I have never been able to replicate his hit ratio. Although the hit ratio simulators in ristretto, theine and otter show about the same results. And they correlate quite well with simulators from caffeine and libcachesim.

phuslu commented 5 months ago

General questions

  1. Thank you for pointing this out, I only vaguely know that there would be a cost because we get benefits by cutting off it so we need pay the price in somewhere.
  2. I also believe the minor throughput gap between phuslu/lru and other traditional implementation (shard lru) comes from the compactness
phuslu commented 5 months ago

For ristretto, yeah I will try verify it's hit rate. It is alarm for me and our internal projects.

maypok86 commented 5 months ago

4 (and 5) Studying and re-writing zipf cases based on your benchmark suite

Actually, it doesn't have to be. The only thing I was trying to point out is that in such benchmarks the amount of work under sync.Mutex makes a big difference and you won't even notice the difference compared to sync.RWMutex for example. And that it is worth specifying in README that there is no expiration policy, because usually specifying ttl implies that items with expired lifetime will be automatically removed from the cache by some algorithm. I was very surprised not to find this in phuslu/lru and go-freelru. I think it might be a surprise for users too. And I wanted to know the motivation to use random key distribution, but you explained it, so it's ok.

phuslu commented 5 months ago
  1. I add "limitation" in readme to explain the eviction policy
  2. Added zipf benchmark results. But I'm not sure that I do it correctly.
  3. Done the ristretto hit ratios repo (git lfs), and updated https://github.com/dgraph-io/ristretto/issues/336#issuecomment-1916049911
maypok86 commented 5 months ago

Anyway, I decided to splurge and run benchmarks on a server processor.

Used configuration: AMD EPYC 32 cores (3.4GHz), 32 GB RAM.

Golang version

root@v2310093:~# go version
go version go1.21.6 linux/amd64

Benchmarks were run with the following command: writeratio=10 go test -run='^$' -cpu=4,8,16,32 -bench . -count=5 -timeout=0 | tee bench.txt

The results are as follows.

Well, as I said before, in these benchmarks almost all implementations have approximately the same speed and the amount of work under shards decides very much, but when the number of available cores increases, contention-free implementations still come out ahead even in spite of more work they do.

maypok86 commented 5 months ago
  1. Thank you very much
  2. Generating zipf values is a very heavy operation, you can understand this just by opening the zipf.Uint64 code, but in idea I would like to see benchmarks with thread-local operation, which will increase contention, but again, to compare lru implementations it is probably really redundant.
  3. Yay, I'm not crazy.🙃
maypok86 commented 5 months ago

Okay, I think we've pretty much covered everything, so that leaves one last question. I still don't get it, is increasing the cache size an expected behavior? (also applies to the author go-freelru) Because it seems very cheaterish. For example, in the hit ratio measurements phuslu/lru was running, for example, on a P8 trace with capacity = 400_000. But after transformations inside phuslu/lru this becomes much larger. For example, these measurements were run with GOMAXPROCS = 16, so the number of shards is 16 16 = 256. Now we calculate the size of each shard. 400_000 / 256 = 1562.5. The next degree of two is 11, i.e. the size of a shard is 2^11 = 2048 and the capacity of the whole cache = 2048 256 = 524_288. Exceeding the size by such a number of elements looks critical. All the more so, because of this the results of the hit ratio simulation are unfair. And in real life it will lead to much higher memory consumption.

maypok86 commented 5 months ago

With capacity = 600_000, real capacity = 1_048_576.

And on DS1 trace, where capacity is always more than a million, things get even worse. For example, when capacity = 3_000_000 the real size is 4_194_304 and so on.

maypok86 commented 5 months ago

Ah, I think I realized, cheaprand was meant as an example of the remainder-of-division trick.

phuslu commented 5 months ago

I found cheaprand is not work for my case, and I rip a fastmod POC here https://go.dev/play/p/rCTblJTtYB1 base on bmkessler/fastdiv

I plan try this PoC in my shard_table.go, I afraid it is slower than mod directly.

phuslu commented 5 months ago

I gave up trying to make hashtable size to fit the shard size(fastmod is slow/broken during my implementing). Just make the LRU nodes list to fit the needed shard size https://github.com/phuslu/lru/commit/5c217c3dec43ea879f373d2175df0f18858ee336

  1. the real capability of my LRU is limited by the LRU "list" now, so its results in hit ratio tests should no longer be cheating.
  2. the extra/additional spaces of hashtable should be OK, because waste a bit []uint64(2 x uint32) is acceptable and it does not affect/impact the hit ratio tests.
phuslu commented 5 months ago

after I run the benchmark in high end linux server with intel cpu, I found the results is more stable as your said and it looks like more objective .

But I still thought a continuous&checkable online results is more acceptable to users than just pasting the oneshot&personal data to readme.

maypok86 commented 5 months ago

I gave up trying to make hashtable size to fit the shard size(fastmod is slow/broken during my implementing). Just make the LRU nodes list to fit the needed shard size https://github.com/phuslu/lru/commit/5c217c3dec43ea879f373d2175df0f18858ee336

It can still store a len(shards) more there, but that's already acceptable, thanks.

Oh, that's now fixed the behavior on S3 too, and now the behavior is almost exactly the same as the standard LRU (ended up being a slightly worse version of it).

As usual, here are the results.

phuslu commented 5 months ago

Many thanks for these ( questions & guide & tests).

Now hopefully I can claim that I tried my best to work out a correct/effective/scaleable(?) LRU cache of golang, it matches the repository title, "High performance LRU cache" 😄

maypok86 commented 5 months ago

But I still thought a continuous&checkable online results is more acceptable to users than just pasting the oneshot&personal data to readme.

Okay, your solution and it even makes sense, but it's not even about stability. It's about the fact that there the distribution of results is very different (as I have already written it here) and there is almost no gap in favor of phuslu/lru even at 4 cores on the distribution of unique keys. And as the number of available cores increases, it starts to disappear and eventually at 16 cores phuslu/lru already starts to fall behind the others (there is very small difference with ecache at >= 8 cores). And I got the same results on m1 processor.