Make comparisons fair - Githubissues

rockdaboot commented 10 months ago

When comparing caches, the amount of used memory by the caches should be roughly equal.

Currently, this is not the case. And some of the cache algorithms that use x times more memory look like they are the best in regards to HITRATE. That is not a fair comparison.

scalalang2 commented 10 months ago

Thank you for your comment ! @rockdaboot

I understand now. You mean that the cache size doesn't have to be the same, but the memory usage should be the same, right?

(1) I understand as follows, Higher hit rates potentially leades to unfair evaluations. I think you're right about this, a cache holds more elements for better hir rates typically consume more memory :)

(2) However, I was wondering about your thoughts on the Otter cache specifically, It uses 8-10x more memory than all other algorithms,

I thought that there is a trade-off relationship between Memory, HITRATE and QPS.

rockdaboot commented 10 months ago

Thank you for your comment ! @rockdaboot

I understand now. You mean that the cache size doesn't have to be the same, but the memory usage should be the same, right?

For comparisons, yes.

For example I know that some implementations (especially when using hashmaps/hashtables) tend to increase the number items stored (size or cap) to the next power of two. (This is done to avoid costly divisions.) Now imagine you create two caches with 5000 entries. One of them allocates 5000 entries, the other one 8129. The second one will have a huge advantage. Additionally, many hashmaps/hashtables have a "load factor", I have seen .75 or .8. So if you create such a cache with N entries, internally N/(1/.75) entries are allocated but only N are used (reduces collisions).

To work around this, let's view a cache as a black box with a certain size in bytes (memory usage) and ignore any internals. Now you can create all test candidates with roughly the same memory usage and compare QPS and HITRATE. With this, you eliminate one of three variables, which makes comparison easier.

And regarding "otter": I didn't take a look into it why it eats so much memory. It's either a bug or a design flaw. I'd ignore this for now or open an issue to ask what is going on.

maypok86 commented 10 months ago

Hi, I certainly don't know where the comments about otter come from, but let me tell you what I think about these benchmarks.

The benchmarks in the README don't really tell the truth. To achieve great scalability otter use buffer sets and unfortunately they consume memory + also require memory on goroutines to maintain the bp-wrapper technique (creating goroutines could be avoided, but for now it is). And in benchmarks on very small cache sizes they can affect.

results:
itemSize=500000, workloads=7500000, cacheSize=0.10%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE | MEMORY  |   QPS   |  HITS   | MISSES
-----------------+---------+---------+---------+---------+----------
  otter          | 47.49%  | 0.68MiB | 5415162 | 3561582 | 3938418
  sieve          | 47.42%  | 0.11MiB | 7183908 | 3556217 | 3943783
  tinylfu        | 47.38%  | 0.10MiB | 7142857 | 3553132 | 3946868
  s3-fifo        | 47.16%  | 0.20MiB | 3614458 | 3537110 | 3962890
  slru           | 46.48%  | 0.11MiB | 5863956 | 3486209 | 4013791
  s4lru          | 46.15%  | 0.11MiB | 9566327 | 3461183 | 4038817
  two-queue      | 45.49%  | 0.16MiB | 3544423 | 3411961 | 4088039
  clock          | 37.34%  | 0.10MiB | 6345178 | 2800767 | 4699233
  lru-hashicorp  | 36.59%  | 0.08MiB | 6019262 | 2744181 | 4755819
  lru-groupcache | 36.59%  | 0.11MiB | 6019262 | 2744181 | 4755819

But otter is more focused on reducing the size of the required metadata for each item in the cache. And as the cache size increases, otter shows great results (note that otter also supports expiration policy and cost-based eviction policy, which also consume memory, and most caches in benchmarks can't do that).

results:
itemSize=500000, workloads=7500000, cacheSize=1.00%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE | MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+---------+----------+---------+----------
  tinylfu        | 63.94%  | 0.99MiB |  7708119 | 4795377 | 2704623
  otter          | 63.60%  | 1.68MiB |  5255781 | 4769837 | 2730163
  s3-fifo        | 63.59%  | 1.73MiB |  3930818 | 4768904 | 2731096
  sieve          | 63.33%  | 1.00MiB |  8081897 | 4750114 | 2749886
  slru           | 62.89%  | 1.03MiB |  6732496 | 4716703 | 2783297
  s4lru          | 62.68%  | 1.06MiB | 10135135 | 4701098 | 2798902
  two-queue      | 61.99%  | 1.52MiB |  4045307 | 4649213 | 2850787
  clock          | 56.12%  | 0.86MiB |  7411067 | 4208674 | 3291326
  lru-hashicorp  | 55.37%  | 1.02MiB |  6996269 | 4153014 | 3346986
  lru-groupcache | 55.37%  | 1.05MiB |  7095553 | 4153014 | 3346986

And also with a larger cache size, which is also present in benchmarks, the following happens

results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE |  MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+----------+----------+---------+----------
  tinylfu        | 80.43%  | 10.10MiB |  8493771 | 6032209 | 1467791
  otter          | 79.86%  | 10.97MiB |  6323777 | 5989814 | 1510186
  s3-fifo        | 79.84%  | 18.69MiB |  4702194 | 5987895 | 1512105
  sieve          | 79.69%  | 10.09MiB |  9590793 | 5976378 | 1523622
  s4lru          | 78.98%  | 10.29MiB | 10259918 | 5923748 | 1576252
  two-queue      | 78.77%  | 15.16MiB |  4826255 | 5907726 | 1592274
  slru           | 78.68%  | 9.93MiB  |  6561680 | 5901224 | 1598776
  clock          | 76.44%  | 9.06MiB  | 10000000 | 5732671 | 1767329
  lru-hashicorp  | 75.82%  | 10.65MiB |  9090909 | 5686182 | 1813818
  lru-groupcache | 75.82%  | 10.67MiB |  8178844 | 5686182 | 1813818

I have already written about calculation of hit ratio and qps in these benchmarks on hacker news.

Showing results only on zipf distribution with the same parameters, not even changing the size and saying you are better is very strange. Especially since the advertised SIEVE is very unstable and it simply can't handle a huge number of workloads. Any simple attack breaks it.
Checking qps on the same benchmarks is strange, because a very large amount of CPU time is wasted on unnecessary work (e.g. benchmarks take into account running goroutines, synchronizing them and calculating hit ratio).
Creating artificially large contention for otter that adapts to increasing workload also doesn't look like a good idea. As a result, otter has to hold many times more workload than the others.

@rockdaboot actually, nobody cares how many entries are allocated in the hash table and it doesn't give any advantage, because it is the eviction policy that limits the number of items in the cache and if the size is exceeded, it deletes the selected items.

maypok86 commented 10 months ago

I will also note that the README in golang-fifo says that

Both S3-FIFO and SIEVE have O(n) time complexity for cache offloading, which occurs only when all objects are hit, meaning a perfect (100%) cache hit percentage.

This is not true, and 50% or even less is more than sufficient.

maypok86 commented 10 months ago

In general, the queue used as a write buffer eats more than half of the memory (with concurrency = 16, memory consumption increases to 0.98 MiB), and about 0.2MiB is probably eaten by read buffers to support high speed even with huge contention. That is another proof that you should not mix benchmarks

rockdaboot commented 10 months ago

@rockdaboot actually, nobody cares how many entries are allocated in the hash table and it doesn't give any advantage

It depends on the use case. E.g. in one case we have strict memory and CPU constraints for an host agent. Internally, it uses many different cache instances (some bigger ones, lots of small ones, mix of single-thread and concurrent situations). So we have to find a balance between total memory usage, HITRATE and QPS. Using a cache that increases memory usage based on the number of goroutines sounds not like a good choice for this case.

In other words, generalized statements like "nobody cares" IMO doesn't help.

So how can we improve the benchmarks? You mentioned one thing, separating benchmarks for different metrics. That sounds good to me. Additionally, I'd like to suggest not using a flat table for the results as the information density is low while hard to compare for a human. A diagram with one curve per cache implementation would improve that. It still needs different type of diagrams to give the reader some value (e.g. X: concurrency and Y: QPS; X: nItems and Y: total memory usage; X: memory usage and Y: HITRATE; etc).

Besides the input parameters that we already have, we should also allow different key and value types. Also the value creation time or memory overhead plays a role in regards to QPS (which is a function of HITRATE).

Sounds complex, but a generalized benchmark is unfair to either one or the other cache implementation.

maypok86 commented 10 months ago

@rockdaboot It really depends on what exactly you need, in all likelihood a regular lru or arc cache will suffice.

There are a few things that need to be defined right away.

Immediately I apologize for a little harsh statements, just I am already wildly annoyed by benchmarks that put a particular implementation in the best light, and the rest trample in the mud. And more often than not they are just wrong, for example, theine benchmarks check the speed of cache misses https://github.com/Yiling-J/go-cache-benchmark-plus/blob/main/benchmark_test.go#L22, and here a person runs benchmarks on a busy github runner, with parallelism clearly exceeding the available number of cores (there are other problems there, but it already doesn't inspire any confidence), also it doesn't say why exactly it is faster than ristretto, and this is a very important point. It cuts the eviction policy, which usually affects hit ratio very much for the worse. Also if you look at the memory consumption benchmarks it also becomes sad, although they are more honest than the benchmarks in this repository, but there are secrets of such performance of its implementation there too. This lru implementation allocates map at once with the right size, while others have to do it several times, and also it does not waste memory on data structure support for deleting expirated elements, but deletes them only after a direct request to them. I am not even talking about the strange hit ratio of ristretto...
Memory consumption of all caches using bp-wrapper technique can indeed grow (very limited) but it happens for other reasons. It's not the hash table (which I just wrote about) or the increase in the number of goroutines, but the fact that in these benchmarks the qps of different goroutines (competing with each other) per cache increases dramatically. So the cache starts spreading out items across buffers and unlocking eviction policy just to update that buffer. Except in high load scenarios you won't see much difference, and applications running this workload are more than willing to do so. PostgreSQL, Kafka and many others do this. Perhaps this is not suitable for your case, then it is worth understanding the qps, required memory limits and required hit ratio (ideally you should know the type of workload as well).
Yes, the benchmark split looks much more honest. There is a problem with the fact that it can be very hard to write good benchmarks for memory consumption. It seems that sometimes hash tables grow at different times, which leads to better performance of different caches at different times and I don't know an easy way to get rid of this problem.
As qps benchmarks I haven't seen anything better than caffeine benchmarks yet (very angry🙃). The benchmark's overhead is thread-local work for an index increment, array lookup, loop, and operation counter, allowing you to utilize the full potential of your hardware. Also, the benchmark has a pre-populated cache with a zipf distribution for the access requests. That creates hot spots as some entries are used much more often than others, just like a real cache would experience. A common mistake is a uniform policy which evenly spreads the contention, which would imply random replacement is the ideal policy. A skewed distribution means that locks suffer higher contention so it exacerbates that as a problem, while also benefiting contention-free implementations who can better utilize the cpu cache. A huge amount of effort (including that of the authors of caffeine and s3-fifo) has been expended to achieve good results on these benchmarks. So I would suggest using my port.
And to check hit ratio there are already simulators on go (ristretto, theine and otter definitely have them and they show about the same results).

maypok86 commented 10 months ago

I will also note one extremely non-obvious thing, which took me a very long time to realize. Running throughput benchmarks on an incompletely filled cache is always a way to nowhere, because such benchmarks penalize caches with high hit ratio, since a cache miss always spends much less CPU time than a cache hit. And as a result, you can easily draw the wrong conclusion from the results of such benchmarks. Because running benchmarks on some other trace will lead to similar problems, because there are traces on which one eviction policy will have excellent results, but on other traces it will be inferior to some other one and how to analyze this is absolutely unclear. Therefore, in my opinion, the best thing is to test the system with a set of different microbenchmarks and from them draw a conclusion about its performance.

scalalang2 commented 10 months ago

Thanks for leaving discussions here. Here's a my another question for @maypok86,

I will also note that the README in golang-fifo says that Both S3-FIFO and SIEVE have O(n) time complexity for cache offloading, which occurs only when all objects are hit, meaning a perfect (100%) cache hit percentage. This is not true, and 50% or even less is more than sufficient.

Can you give me some of example when SIEVE has O(N) time complexity in evcition phase? I thought that it only happens if all frequencies of entries are equal to 1.

Showing results only on zipf distribution with the same parameters, not even changing the size and saying you are better is very strange. Especially since the advertised SIEVE is very unstable and it simply can't handle a huge number of workloads. Any simple attack breaks it.

I'm convinced with the statement.

results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE |  MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+----------+----------+---------+----------
  tinylfu        | 80.43%  | 10.10MiB |  8493771 | 6032209 | 1467791
  otter          | 79.86%  | 10.97MiB |  6323777 | 5989814 | 1510186
  s3-fifo        | 79.84%  | 18.69MiB |  4702194 | 5987895 | 1512105
  sieve          | 79.69%  | 10.09MiB |  9590793 | 5976378 | 1523622
  s4lru          | 78.98%  | 10.29MiB | 10259918 | 5923748 | 1576252
  two-queue      | 78.77%  | 15.16MiB |  4826255 | 5907726 | 1592274
  slru           | 78.68%  | 9.93MiB  |  6561680 | 5901224 | 1598776
  clock          | 76.44%  | 9.06MiB  | 10000000 | 5732671 | 1767329
  lru-hashicorp  | 75.82%  | 10.65MiB |  9090909 | 5686182 | 1813818
  lru-groupcache | 75.82%  | 10.67MiB |  8178844 | 5686182 | 1813818

Thank you for sharing the result. It's impressed that otter uses almost same memory to others.

I'll also include this too in README.md or use other visualization tools to show its tendency better.

scalalang2 commented 10 months ago

So how can we improve the benchmarks? You mentioned one thing, separating benchmarks for different metrics. That sounds good to me. Additionally, I'd like to suggest not using a flat table for the results as the information density is low while hard to compare for a human. A diagram with one curve per cache implementation would improve that. It still needs different type of diagrams to give the reader some value (e.g. X: concurrency and Y: QPS; X: nItems and Y: total memory usage; X: memory usage and Y: HITRATE; etc).

Thank you for your opinion : ) I'm convinced with opninions that you guys left.

I'll consider adopting go-echart to visualize its results better.

just I am already wildly annoyed by benchmarks that put a particular implementation in the best light I admit my fault in this case.

I will change the README.md to include benchmark results that are as neutral as possible in the near future.

maypok86 commented 10 months ago

@scalalang2

Can you give me some of example when SIEVE has O(N) time complexity in evcition phase? I thought that it only happens if all frequencies of entries are equal to 1.

The main thing to realize here is that O(n) is an asymptotic estimate of complexity. If you need n / 10 operations to evict, it's still O(n). Just by definition of big O from calculus. That is, a much smaller number of elements with a frequency of 1 is sufficient.

maypok86 commented 10 months ago

@scalalang2 Thanks so much for the link to go-echarts, I've been looking for a long time for a go package could draw charts nicely without doing a lot of work and go-echarts looks very cool.

scalalang2 commented 10 months ago

The main thing to realize here is that O(n) is an asymptotic estimate of complexity. If you need n / 10 operations to evict, it's still O(n). Just by definition of big O from calculus. That is, a much smaller number of elements with a frequency of 1 is sufficient.

Great, I got the point 👍

scalalang2 / go-cache-benchmark

Make comparisons fair #4