Closed maypok86 closed 9 months ago
Thanks for touching.
I corrected two obvious mistakes
For other points of you mentioned, I need do more study and thoughts.
Throughput benchmarks
b.SetParallelism(32)
intentionally and think it's reasonable, because I'd like to simulate a highly concurrent web handler.Memory usage benchmarks
General questions:
Oh, I think I got a bit your point
Q: Why choose to implement a fixed size cache without active/background eviction? A: I'm a fan of fastcache philosophy, so I'd like to get a "generics" version alternative to fastcache with a simple LRU strategy
Throughput benchmarks
1 Maybe, but in the end the results of such runs don't inspire much confidence, especially since already now in github actions phuslu/lru loses to go-freelru. Especially since the numbers in benchmarks are just huge for such operations.
2 We'd all like to test this, but the bottom line is that you can't execute more than the number of goroutines in parallel than the number of available CPU cores physically. Which makes you check the speed of the scheduler, although you still want to see the speed of the caches.
4 (and 5) I don't think I explained it well. Let's do it again. First, you will never insert an element with i > threshold. Threshold = cachesize / 10, and the cache would fill for all i < cachesize / 2. The end result is that all set operations are updates, and half of get operations are cache misses, which is pretty sad considering that cache misses are cheaper than cache hits, and with updates phuslu/lru just doesn't waste time maintaining the expiration policy. The second thing I've also run into is that using a distribution from unique keys for benchmarks is pretty bad at reflecting real-world behavior. Usually in real life the workload at least remotely resembles a zipf distribution with hot and cold elements. In your benchmarks, because of the unique key distribution, there is no difference between using sync.Mutex and sync.RWMutex, because the fastpath is always triggered. Anyway, the key question here is "Why are you trying to test on unique keys?". Really very interesting because I can't find any decent explanation. There is also a problem with fastrand, because in the first iteration you can have i = 16, and in the second i = 999_999 and so on, which prevents the cpu/compiler from utilizing the full potential of your hardware, but okay.
Memory benchmarks
2 Yes, it's an advantage, no one is even against that advantage🙃. The only problem is that in your benchmarks keys take up more memory than values, which benefits phuslu/lru and otter, but in real life values usually take up more space than keys.
3 Ok, although that's pretty sad considering facebook and twitter's statements about expiration policies. Oh and memory usage comparisons don't look fair then.
General questions
Perhaps my questions may seem picky, but I really want to make better comparisons and learn something new. I've given up trying to make memory usage comparisons honest, but throughput benchmarks and hit ratio simulation I believe we can try to make them honest, although rounding cache sizes make that difficult too.
Throughput benchmarks
1 github actions is cheap(trigger on each push) and visible, I'd like to keep it.
2 I changed to concurrency to 8(same as the cpu number).
4 (and 5) Understood, I give a quick/temp fix by call fastrand twice, will try simplify this case soon . And for reason I choose "unique key test"(I call it randomly read/write) because it's easier to implemented and I think it's fair to major go lru caches. I agree that it's not fair/honest to RTO(Ristretto/Theine/Otter), let me add zipf benchmark later.
Memory benchmarks
2 Agree, but I think I followed the common test styles/cases on the "market".
3 Emmm, this does set up a restricted scenario for phuslu/lru, but it's really everyone's choice
General questions
Many of the questions you raised are very insightful and let me learn/study. One of motivation that I spend a lot of time to tweak/revise this implementation is: I'd like show that the traditional implementation (shards + lru) still have its value if we implement it carefully.
The initial motivation https://twitter.com/phuslu/status/1743236300977922502 and during the implementation I was greatly influenced by fastcache philosophy.
I checked your one of benchmark results here https://github.com/maypok86/benchmarks/blob/main/ratio/ratio.txt the Ristretto results is very abnormal , any reason/root cause?
Throughput benchmarks
1 This was actually the main reason why I came at all. Unfortunately the github actions don't look like cheap, at least because of the huge latency in benchmarks. Especially since I have a suspicion that you were given much less than 8 CPU cores, because for server CPUs those are very weak numbers.
2 It is not required and even harmful, as you can make sure of it just by opening the SetParallelism description
SetParallelism sets the number of goroutines used by RunParallel to p*GOMAXPROCS. There is usually no need to call SetParallelism for CPU-bound benchmarks. If p is less than 1, this call will have no effect.
4 (and 5) Yes, such a comparison for lru implementations makes sense in principle, but here it is the amount of work the implementation does that comes to the fore. And in the end it's the number of features that slow down the implementation but give more guarantees of convenience to users that decides everything. This is quite controversial, but okay, let's consider it a tradoff. Although as a user then I would like to know from the README about the lack of an expiration policy for example, because for ttl it is usually implied. To be honest, I don't really care about RTO performance on such benchmarks, challenging them was not part of my motivation for this issue. Especially since on 8 free cores their performance is much better. I was rather interested to know why people choose an unrealistic distribution that hides problems.
I checked your one of benchmark results here https://github.com/maypok86/benchmarks/blob/main/ratio/ratio.txt the Ristretto results is very abnormal, any reason/root cause?
This is a fairly well known issue, which was the main reason for the creation of otter. There are already many issues about ristretto's abysmal performance on their own hit ratio benchmarks (first and second). And the authors do not respond to it in any way unfortunately. If you don't want to run big benchmarks, you can just run this one. Even on such a simple zipf distribution the ristretto performance is abysmal.
Memory benchmarks
2 and 3 A little controversial, but basically okay.
General questions
Throughput benchmarks
SetParallelism
P.S. It would be cool if you tried running hit ratio tests for ristretto. Maybe me and the author of theine have developed schizophrenia😂, but I have never been able to replicate his hit ratio. Although the hit ratio simulators in ristretto, theine and otter show about the same results. And they correlate quite well with simulators from caffeine and libcachesim.
General questions
For ristretto, yeah I will try verify it's hit rate. It is alarm for me and our internal projects.
4 (and 5) Studying and re-writing zipf cases based on your benchmark suite
Actually, it doesn't have to be. The only thing I was trying to point out is that in such benchmarks the amount of work under sync.Mutex makes a big difference and you won't even notice the difference compared to sync.RWMutex for example. And that it is worth specifying in README that there is no expiration policy, because usually specifying ttl implies that items with expired lifetime will be automatically removed from the cache by some algorithm. I was very surprised not to find this in phuslu/lru and go-freelru. I think it might be a surprise for users too. And I wanted to know the motivation to use random key distribution, but you explained it, so it's ok.
Anyway, I decided to splurge and run benchmarks on a server processor.
Used configuration: AMD EPYC 32 cores (3.4GHz), 32 GB RAM.
Golang version
root@v2310093:~# go version
go version go1.21.6 linux/amd64
Benchmarks were run with the following command: writeratio=10 go test -run='^$' -cpu=4,8,16,32 -bench . -count=5 -timeout=0 | tee bench.txt
The results are as follows.
Well, as I said before, in these benchmarks almost all implementations have approximately the same speed and the amount of work under shards decides very much, but when the number of available cores increases, contention-free implementations still come out ahead even in spite of more work they do.
Okay, I think we've pretty much covered everything, so that leaves one last question. I still don't get it, is increasing the cache size an expected behavior? (also applies to the author go-freelru) Because it seems very cheaterish. For example, in the hit ratio measurements phuslu/lru was running, for example, on a P8 trace with capacity = 400_000. But after transformations inside phuslu/lru this becomes much larger. For example, these measurements were run with GOMAXPROCS = 16, so the number of shards is 16 16 = 256. Now we calculate the size of each shard. 400_000 / 256 = 1562.5. The next degree of two is 11, i.e. the size of a shard is 2^11 = 2048 and the capacity of the whole cache = 2048 256 = 524_288. Exceeding the size by such a number of elements looks critical. All the more so, because of this the results of the hit ratio simulation are unfair. And in real life it will lead to much higher memory consumption.
With capacity = 600_000, real capacity = 1_048_576.
And on DS1 trace, where capacity is always more than a million, things get even worse. For example, when capacity = 3_000_000 the real size is 4_194_304 and so on.
Ah, I think I realized, cheaprand was meant as an example of the remainder-of-division trick.
I found cheaprand is not work for my case, and I rip a fastmod POC here https://go.dev/play/p/rCTblJTtYB1 base on bmkessler/fastdiv
I plan try this PoC in my shard_table.go, I afraid it is slower than mod directly.
I gave up trying to make hashtable size to fit the shard size(fastmod is slow/broken during my implementing). Just make the LRU nodes list to fit the needed shard size https://github.com/phuslu/lru/commit/5c217c3dec43ea879f373d2175df0f18858ee336
after I run the benchmark in high end linux server with intel cpu, I found the results is more stable as your said and it looks like more objective .
But I still thought a continuous&checkable online results is more acceptable to users than just pasting the oneshot&personal data to readme.
I gave up trying to make hashtable size to fit the shard size(fastmod is slow/broken during my implementing). Just make the LRU nodes list to fit the needed shard size https://github.com/phuslu/lru/commit/5c217c3dec43ea879f373d2175df0f18858ee336
It can still store a len(shards) more there, but that's already acceptable, thanks.
Oh, that's now fixed the behavior on S3 too, and now the behavior is almost exactly the same as the standard LRU (ended up being a slightly worse version of it).
As usual, here are the results.
Many thanks for these ( questions & guide & tests).
Now hopefully I can claim that I tried my best to work out a correct/effective/scaleable(?) LRU cache of golang, it matches the repository title, "High performance LRU cache" 😄
But I still thought a continuous&checkable online results is more acceptable to users than just pasting the oneshot&personal data to readme.
Okay, your solution and it even makes sense, but it's not even about stability. It's about the fact that there the distribution of results is very different (as I have already written it here) and there is almost no gap in favor of phuslu/lru even at 4 cores on the distribution of unique keys. And as the number of available cores increases, it starts to disappear and eventually at 16 cores phuslu/lru already starts to fall behind the others (there is very small difference with ecache at >= 8 cores). And I got the same results on m1 processor.
Hi @phuslu, very interesting work, but some things in your repository raise questions for me.
Throughput benchmarks
b.SetParallelism(32)
is also unnecessary, since you run more goroutines than the number of available cores and the scheduler ends up spending a lot of CPU time switching between them.Otter is also configured incorrectly, but it's not critical for now. But it will be soon!🙂
And now if you run the original benchmarks without runtime limit on 8 free perfomance cores with the command
go test -run='^$' -cpu=8 -bench . -timeout=0 -benchmem
After other fixes we get the following
Here you should make sure that all implementations use the same number of shards, but I'm too lazy to do it anymore.
i := int(fastrandn(cachesize)); i <= threshold
. Do you really want to always write the same items and read the same items? Because that is exactly what you are doing.Memory usage benchmarks
So your implementation does show good memory utilization, but after fixing the implementation and benchmarks the gap will not be so big.
General questions