maypok86 / otter

A high performance cache for Go
https://maypok86.github.io/otter/
Apache License 2.0
1.73k stars 44 forks source link

sieve #37

Closed mskorina closed 10 months ago

mskorina commented 10 months ago

Hi @maypok86 :)

What do you think about sieve ?) https://junchengyang.com/publication/nsdi24-SIEVE.pdf

I found a rep scalalang2.

He released sieve, and otter comparison in repo with benchmark: https://github.com/scalalang2/go-cache-benchmark (strange sieve behavior with concurrency>1, need to retest with current version of otter) And implementation in https://github.com/1a1a11a/libCacheSim

maypok86 commented 10 months ago

Hi, now this is going to be a very interesting conversation.

First, about SIEVE. Yes, it is simple and shows good results on zipf distribution, but it is absolutely not resistant to various workloads. It is not scan and loop resistant and on many workloads it will perform much worse than LRU (block IO for example). There was also a big discussion about this on hacker news recently. Unfortunately, most researchers don't show the vulnerabilities of their algorithms, and this can be a disaster in production.

Now about memory consumption (it is shown to be very large in comparison to the others). There is one main point to understand here. Otter uses buffers for acceleration and they consume memory, usually less than 1MB, but it can be more depending on the parallelism of your application (on a 64-core machine it should not exceed 3-4MB). This can be fixed, but it would require a very big effort and I don't think anyone is that much afraid of losing a couple MB these days. Instead otter focuses on reducing the extra memory required to store a single item. And I think I've been able to achieve good results in this.

Now let's try to look at the benchmark results.

README uses a very small cache size, which makes the size of the buffers noticeable (and hides one of the main drawbacks of the scalalang2 implementation🙃).

results:
itemSize=500000, workloads=7500000, cacheSize=0.10%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE | MEMORY  |   QPS   |  HITS   | MISSES
-----------------+---------+---------+---------+---------+----------
  otter          | 47.49%  | 0.68MiB | 5415162 | 3561582 | 3938418
  sieve          | 47.42%  | 0.11MiB | 7183908 | 3556217 | 3943783
  tinylfu        | 47.38%  | 0.10MiB | 7142857 | 3553132 | 3946868
  s3-fifo        | 47.16%  | 0.20MiB | 3614458 | 3537110 | 3962890
  slru           | 46.48%  | 0.11MiB | 5863956 | 3486209 | 4013791
  s4lru          | 46.15%  | 0.11MiB | 9566327 | 3461183 | 4038817
  two-queue      | 45.49%  | 0.16MiB | 3544423 | 3411961 | 4088039
  clock          | 37.34%  | 0.10MiB | 6345178 | 2800767 | 4699233
  lru-hashicorp  | 36.59%  | 0.08MiB | 6019262 | 2744181 | 4755819
  lru-groupcache | 36.59%  | 0.11MiB | 6019262 | 2744181 | 4755819

But as soon as the cache size increases at least to 10% the results become opposite, and interestingly this size is in the benchmarks code, but for some reason is not included in README.

results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE |  MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+----------+----------+---------+----------
  tinylfu        | 80.43%  | 10.10MiB |  8493771 | 6032209 | 1467791
  otter          | 79.86%  | 10.97MiB |  6323777 | 5989814 | 1510186
  s3-fifo        | 79.84%  | 18.69MiB |  4702194 | 5987895 | 1512105
  sieve          | 79.69%  | 10.09MiB |  9590793 | 5976378 | 1523622
  s4lru          | 78.98%  | 10.29MiB | 10259918 | 5923748 | 1576252
  two-queue      | 78.77%  | 15.16MiB |  4826255 | 5907726 | 1592274
  slru           | 78.68%  | 9.93MiB  |  6561680 | 5901224 | 1598776
  clock          | 76.44%  | 9.06MiB  | 10000000 | 5732671 | 1767329
  lru-hashicorp  | 75.82%  | 10.65MiB |  9090909 | 5686182 | 1813818
  lru-groupcache | 75.82%  | 10.67MiB |  8178844 | 5686182 | 1813818

That's why I'm focusing on reducing the metadata size for each item. Each of the lost bytes there is multiplied by the number of items in the cache and becomes a problem. Because of which scalalang2's s3fifo consumes a huge amount of memory (this is caused by poor data structure for storing history). It is also worth remembering here that otter spends memory to support expiration and cost-based eviction policies, which almost all caches in these benchmarks do not. And even with the memory consumption for this, otter starts to overtake other implementations when the cache size increases.

results:
itemSize=500000, workloads=7500000, cacheSize=40.00%, zipf's alpha=0.99, concurrency=1

      CACHE      | HITRATE |  MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+----------+----------+---------+---------
  tinylfu        | 89.85%  | 39.97MiB |  6366723 | 6738578 | 761422
  sieve          | 89.77%  | 40.08MiB |  8591065 | 6732517 | 767483
  s3-fifo        | 89.60%  | 60.07MiB |  5098572 | 6720153 | 779847
  otter          | 89.57%  | 39.04MiB |  7338552 | 6717478 | 782522
  clock          | 89.23%  | 33.68MiB | 10053619 | 6692243 | 807757
  two-queue      | 89.17%  | 56.20MiB |  5681818 | 6687639 | 812361
  lru-hashicorp  | 88.94%  | 40.05MiB |  8361204 | 6670716 | 829284
  lru-groupcache | 88.94%  | 40.08MiB |  6925208 | 6670716 | 829284
  s4lru          | 88.23%  | 40.20MiB |  7411067 | 6617037 | 882963
  slru           | 87.27%  | 38.83MiB |  5617978 | 6545284 | 954716

Now the hardest thing to understand is the hit ratio drop when concurrency increases. The main reason why otter (and caffeine too) is so fast is that it allows loss of reads (actually and many eviction policies do this, e.g. s3-fifo does not distinguish between element frequencies greater than three) while increasing contention. And the scalalang2 benchmarks do exactly that.

results:
itemSize=500000, workloads=7500000, cacheSize=10.00%, zipf's alpha=0.99, concurrency=8

      CACHE      | HITRATE |  MEMORY  |   QPS    |  HITS   | MISSES
-----------------+---------+----------+----------+---------+----------
  tinylfu        | 80.44%  | 10.10MiB |  3132832 | 6032650 | 1467350
  s3-fifo        | 79.83%  | 18.70MiB |  2926258 | 5987184 | 1512816
  sieve          | 79.70%  | 10.09MiB |  5403458 | 5977349 | 1522651
  s4lru          | 78.98%  | 10.28MiB |  3276540 | 5923600 | 1576400
  two-queue      | 78.78%  | 15.16MiB |  2883506 | 5908763 | 1591237
  slru           | 78.68%  | 9.93MiB  |  3038898 | 5901365 | 1598635
  otter          | 76.96%  | 11.40MiB | 11867089 | 5771795 | 1728205
  clock          | 76.44%  | 9.07MiB  |  3696402 | 5733309 | 1766691
  lru-hashicorp  | 75.83%  | 10.65MiB |  2580867 | 5687354 | 1812646
  lru-groupcache | 75.83%  | 10.68MiB |  3322995 | 5687011 | 1812989

Although these benchmarks cannot be used as benchmarks to check throughput because they spend a lot of time on running goroutines, synchronizing the zipf distribution generator, etc., but you can see a huge difference in throughput between otter and other caches. And in fact, these benchmarks do not just measure hit ratio, but ask otter to handle all the load (in benchmarks otter is hit by the largest number of requests by a large margin) with 8 cores available, keeping the hit ratio as high as possible. And otter fulfilled this request by withstanding the load many times more than the others, sacrificing two or three percent of the hit ratio. It should be understood that in a normal situation you will see absolutely no difference in hit ratio, because there will not be a big contention, but when the load increases, the other caches will not be able to withstand the load. This approach is used in a huge number of high load systems. (Kafka, Neo4j, Dgraph, Cassandra, Druid and other more closed projects) with excellent results.

Well, besides higher memory consumption, the scalalang2 implementation has the worst case time complexity of O(n), while otter's s3-fifo version is only O(1).

mskorina commented 10 months ago

Thank you for your detailed response)

maypok86 commented 10 months ago

Anyway, I did some research and it turned out that the queue used as a write buffer eats more than half of memory (at concurrency = 16 memory consumption increases to 0.98 MiB), and about 0.2MiB is probably eaten by read buffers to support high speed even with huge contention (because of contention queues are full). That is another proof that you can't mix benchmarks.