Open rockdaboot opened 10 months ago
Thank you for your comment ! @rockdaboot
I understand now. You mean that the cache size doesn't have to be the same, but the memory usage should be the same, right?
(1) I understand as follows, Higher hit rates potentially leades to unfair evaluations. I think you're right about this, a cache holds more elements for better hir rates typically consume more memory :)
(2) However, I was wondering about your thoughts on the Otter cache specifically, It uses 8-10x more memory than all other algorithms,
I thought that there is a trade-off relationship between Memory, HITRATE and QPS.
Thank you for your comment ! @rockdaboot
I understand now. You mean that the cache size doesn't have to be the same, but the memory usage should be the same, right?
For comparisons, yes.
For example I know that some implementations (especially when using hashmaps/hashtables) tend to increase the number items stored (size or cap) to the next power of two. (This is done to avoid costly divisions.) Now imagine you create two caches with 5000 entries. One of them allocates 5000 entries, the other one 8129. The second one will have a huge advantage. Additionally, many hashmaps/hashtables have a "load factor", I have seen .75 or .8. So if you create such a cache with N entries, internally N/(1/.75) entries are allocated but only N are used (reduces collisions).
To work around this, let's view a cache as a black box with a certain size in bytes (memory usage) and ignore any internals. Now you can create all test candidates with roughly the same memory usage and compare QPS and HITRATE. With this, you eliminate one of three variables, which makes comparison easier.
And regarding "otter": I didn't take a look into it why it eats so much memory. It's either a bug or a design flaw. I'd ignore this for now or open an issue to ask what is going on.
Hi, I certainly don't know where the comments about otter come from, but let me tell you what I think about these benchmarks.
The benchmarks in the README don't really tell the truth. To achieve great scalability otter use buffer sets and unfortunately they consume memory + also require memory on goroutines to maintain the bp-wrapper technique (creating goroutines could be avoided, but for now it is). And in benchmarks on very small cache sizes they can affect.
results:
itemSize=500000, workloads=7500000, cacheSize=0.10%, zipf's alpha=0.99, concurrency=1
CACHE | HITRATE | MEMORY | QPS | HITS | MISSES
-----------------+---------+---------+---------+---------+----------
otter | 47.49% | 0.68MiB | 5415162 | 3561582 | 3938418
sieve | 47.42% | 0.11MiB | 7183908 | 3556217 | 3943783
tinylfu | 47.38% | 0.10MiB | 7142857 | 3553132 | 3946868
s3-fifo | 47.16% | 0.20MiB | 3614458 | 3537110 | 3962890
slru | 46.48% | 0.11MiB | 5863956 | 3486209 | 4013791
s4lru | 46.15% | 0.11MiB | 9566327 | 3461183 | 4038817
two-queue | 45.49% | 0.16MiB | 3544423 | 3411961 | 4088039
clock | 37.34% | 0.10MiB | 6345178 | 2800767 | 4699233
lru-hashicorp | 36.59% | 0.08MiB | 6019262 | 2744181 | 4755819
lru-groupcache | 36.59% | 0.11MiB | 6019262 | 2744181 | 4755819
But otter is more focused on reducing the size of the required metadata for each item in the cache. And as the cache size increases, otter shows great results (note that otter also supports expiration policy and cost-based eviction policy, which also consume memory, and most caches in benchmarks can't do that).
results:
itemSize=500000, workloads=7500000, cacheSize=1.00%, zipf's alpha=0.99, concurrency=1
CACHE | HITRATE | MEMORY | QPS | HITS | MISSES
-----------------+---------+---------+----------+---------+----------
tinylfu | 63.94% | 0.99MiB | 7708119 | 4795377 | 2704623
otter | 63.60% | 1.68MiB | 5255781 | 4769837 | 2730163
s3-fifo | 63.59% | 1.73MiB | 3930818 | 4768904 | 2731096
sieve | 63.33% | 1.00MiB | 8081897 | 4750114 | 2749886
slru | 62.89% | 1.03MiB | 6732496 | 4716703 | 2783297
s4lru | 62.68% | 1.06MiB | 10135135 | 4701098 | 2798902
two-queue | 61.99% | 1.52MiB | 4045307 | 4649213 | 2850787
clock | 56.12% | 0.86MiB | 7411067 | 4208674 | 3291326
lru-hashicorp | 55.37% | 1.02MiB | 6996269 | 4153014 | 3346986
lru-groupcache | 55.37% | 1.05MiB | 7095553 | 4153014 | 3346986
And also with a larger cache size, which is also present in benchmarks, the following happens
results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=1
CACHE | HITRATE | MEMORY | QPS | HITS | MISSES
-----------------+---------+----------+----------+---------+----------
tinylfu | 80.43% | 10.10MiB | 8493771 | 6032209 | 1467791
otter | 79.86% | 10.97MiB | 6323777 | 5989814 | 1510186
s3-fifo | 79.84% | 18.69MiB | 4702194 | 5987895 | 1512105
sieve | 79.69% | 10.09MiB | 9590793 | 5976378 | 1523622
s4lru | 78.98% | 10.29MiB | 10259918 | 5923748 | 1576252
two-queue | 78.77% | 15.16MiB | 4826255 | 5907726 | 1592274
slru | 78.68% | 9.93MiB | 6561680 | 5901224 | 1598776
clock | 76.44% | 9.06MiB | 10000000 | 5732671 | 1767329
lru-hashicorp | 75.82% | 10.65MiB | 9090909 | 5686182 | 1813818
lru-groupcache | 75.82% | 10.67MiB | 8178844 | 5686182 | 1813818
I have already written about calculation of hit ratio and qps in these benchmarks on hacker news.
@rockdaboot actually, nobody cares how many entries are allocated in the hash table and it doesn't give any advantage, because it is the eviction policy that limits the number of items in the cache and if the size is exceeded, it deletes the selected items.
I will also note that the README in golang-fifo says that
Both S3-FIFO and SIEVE have O(n) time complexity for cache offloading, which occurs only when all objects are hit, meaning a perfect (100%) cache hit percentage.
This is not true, and 50% or even less is more than sufficient.
In general, the queue used as a write buffer eats more than half of the memory (with concurrency = 16, memory consumption increases to 0.98 MiB), and about 0.2MiB is probably eaten by read buffers to support high speed even with huge contention. That is another proof that you should not mix benchmarks
@rockdaboot actually, nobody cares how many entries are allocated in the hash table and it doesn't give any advantage
It depends on the use case. E.g. in one case we have strict memory and CPU constraints for an host agent. Internally, it uses many different cache instances (some bigger ones, lots of small ones, mix of single-thread and concurrent situations). So we have to find a balance between total memory usage, HITRATE and QPS. Using a cache that increases memory usage based on the number of goroutines sounds not like a good choice for this case.
In other words, generalized statements like "nobody cares" IMO doesn't help.
So how can we improve the benchmarks? You mentioned one thing, separating benchmarks for different metrics. That sounds good to me. Additionally, I'd like to suggest not using a flat table for the results as the information density is low while hard to compare for a human. A diagram with one curve per cache implementation would improve that. It still needs different type of diagrams to give the reader some value (e.g. X: concurrency and Y: QPS; X: nItems and Y: total memory usage; X: memory usage and Y: HITRATE; etc).
Besides the input parameters that we already have, we should also allow different key and value types. Also the value creation time or memory overhead plays a role in regards to QPS (which is a function of HITRATE).
Sounds complex, but a generalized benchmark is unfair to either one or the other cache implementation.
@rockdaboot It really depends on what exactly you need, in all likelihood a regular lru or arc cache will suffice.
There are a few things that need to be defined right away.
I will also note one extremely non-obvious thing, which took me a very long time to realize. Running throughput benchmarks on an incompletely filled cache is always a way to nowhere, because such benchmarks penalize caches with high hit ratio, since a cache miss always spends much less CPU time than a cache hit. And as a result, you can easily draw the wrong conclusion from the results of such benchmarks. Because running benchmarks on some other trace will lead to similar problems, because there are traces on which one eviction policy will have excellent results, but on other traces it will be inferior to some other one and how to analyze this is absolutely unclear. Therefore, in my opinion, the best thing is to test the system with a set of different microbenchmarks and from them draw a conclusion about its performance.
Thanks for leaving discussions here. Here's a my another question for @maypok86,
I will also note that the README in golang-fifo says that Both S3-FIFO and SIEVE have O(n) time complexity for cache offloading, which occurs only when all objects are hit, meaning a perfect (100%) cache hit percentage. This is not true, and 50% or even less is more than sufficient.
Can you give me some of example when SIEVE has O(N) time complexity in evcition phase? I thought that it only happens if all frequencies of entries are equal to 1.
Showing results only on zipf distribution with the same parameters, not even changing the size and saying you are better is very strange. Especially since the advertised SIEVE is very unstable and it simply can't handle a huge number of workloads. Any simple attack breaks it.
I'm convinced with the statement.
results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=1
CACHE | HITRATE | MEMORY | QPS | HITS | MISSES
-----------------+---------+----------+----------+---------+----------
tinylfu | 80.43% | 10.10MiB | 8493771 | 6032209 | 1467791
otter | 79.86% | 10.97MiB | 6323777 | 5989814 | 1510186
s3-fifo | 79.84% | 18.69MiB | 4702194 | 5987895 | 1512105
sieve | 79.69% | 10.09MiB | 9590793 | 5976378 | 1523622
s4lru | 78.98% | 10.29MiB | 10259918 | 5923748 | 1576252
two-queue | 78.77% | 15.16MiB | 4826255 | 5907726 | 1592274
slru | 78.68% | 9.93MiB | 6561680 | 5901224 | 1598776
clock | 76.44% | 9.06MiB | 10000000 | 5732671 | 1767329
lru-hashicorp | 75.82% | 10.65MiB | 9090909 | 5686182 | 1813818
lru-groupcache | 75.82% | 10.67MiB | 8178844 | 5686182 | 1813818
Thank you for sharing the result. It's impressed that otter uses almost same memory to others.
I'll also include this too in README.md or use other visualization tools to show its tendency better.
So how can we improve the benchmarks? You mentioned one thing, separating benchmarks for different metrics. That sounds good to me. Additionally, I'd like to suggest not using a flat table for the results as the information density is low while hard to compare for a human. A diagram with one curve per cache implementation would improve that. It still needs different type of diagrams to give the reader some value (e.g. X: concurrency and Y: QPS; X: nItems and Y: total memory usage; X: memory usage and Y: HITRATE; etc).
Thank you for your opinion : ) I'm convinced with opninions that you guys left.
I'll consider adopting go-echart to visualize its results better.
just I am already wildly annoyed by benchmarks that put a particular implementation in the best light I admit my fault in this case.
I will change the README.md to include benchmark results that are as neutral as possible in the near future.
@scalalang2
Can you give me some of example when SIEVE has O(N) time complexity in evcition phase? I thought that it only happens if all frequencies of entries are equal to 1.
The main thing to realize here is that O(n) is an asymptotic estimate of complexity. If you need n / 10 operations to evict, it's still O(n). Just by definition of big O from calculus. That is, a much smaller number of elements with a frequency of 1 is sufficient.
@scalalang2 Thanks so much for the link to go-echarts, I've been looking for a long time for a go package could draw charts nicely without doing a lot of work and go-echarts looks very cool.
The main thing to realize here is that O(n) is an asymptotic estimate of complexity. If you need n / 10 operations to evict, it's still O(n). Just by definition of big O from calculus. That is, a much smaller number of elements with a frequency of 1 is sufficient.
Great, I got the point 👍
When comparing caches, the amount of used memory by the caches should be roughly equal.
Currently, this is not the case. And some of the cache algorithms that use x times more memory look like they are the best in regards to HITRATE. That is not a fair comparison.