moka-rs / moka

A high performance concurrent caching library for Rust
Apache License 2.0
1.63k stars 73 forks source link

Consider switching to TinyUFO #411

Open zaidoon1 opened 7 months ago

zaidoon1 commented 7 months ago

See https://crates.io/crates/TinyUFO. I think we would see a good perf boost from this

ben-manes commented 7 months ago

Actually that policy they use is called W-TinyLFU and it's documented on this wiki as a future feature. Their implementation has a few gaps that ideally would be addressed,

tatsuya6502 commented 7 months ago

Hi. Thank you for the information. I reviewed the design of TinyUFO and S3-FIFO. I also compared the performance of a TinyUFO implementation (TinyUFO@0.1.0) against moka@0.12.6 using my mokabench tool with the ARC-S3 trace dataset.

Here is my first thought:

  1. We can learn from its design and could apply some of them to moka.
  2. However we do not have to rush as the performance gain might be small and we have very limited development bandwidth.

Before looking at the mokabench result, let me clarify the current development priority of moka.

(Ordered by higher to lower priorities)

  1. Provide high cache hit rate.
  2. Easy to use.
    • e.g. No lock is needed in the user code.
  3. Convenient features.
    • e.g. Expirations, atomic insert (get_with, and_compute_with, etc.), eviction listener, iterator, cache statistics.
  4. Small memory footprint.
  5. Short compile time and small binary size.
  6. Small performance overhead compared to a concurrent hash table.

As for 1 (hit rate), I believe TinyUFO will provide similar hit rate to W-TinyLFU with fixed size of W. High hit rate is critical for application performance as the cache miss penalty (the extra latency to get the data from a slower media) is much greater than the latency to read the data from the cache.

From the mokabench result I am going to show, the average latency per cache operation (read or write) were the followings:

As for examples of cache miss penalties, here is a quote from a book: The Systems Performance: Enterprise and the Cloud, 2nd Edition by Brendan Gregg

systems-performance-p26

The "latency" column shows example system latencies, and the "scaled" column shows the latencies scaled to an imaginary system in which a CPU cycle takes one full second.

As you can see, 42 minutes is almost nothing compared to the latencies of accessing HDD or Internet.

So, in general, 1 (hit rate) is much more important than 6 (small performance overhead). And by design, TinyUFO and W-TinyLFU (with fixed size of W) will provide competing hit rates.

As for 4 (memory footprint), TinyUFO can do better than W-TinyLFU as they use queues and doubly linked lists respectively. A queue can be array-based and have smaller memory footprint than a doubly linked list. However, queue is not a suitable data structure to provide efficient algorithm for entry expirations, so moka may have to continue using doubly linked lists to support expirations. This is a trade-off between 4 and 3 (convenient features).

There is a trade-off between 3 and 6 (small performance overhead) too. So, moka that prioritizes 3 will generally perform less well than other cache implementations that offer simpler functionality. But this will not be a problem in real world applications as 1 (hit rate) has much greater impact to application performance than 6.

OK. Having say that. Here is the mokabench result.

Cache Max Capacity Estimated number of keys for LFU Clients Inserts Reads Hit Ratio Memory Usage (RSS) Duration Secs Avg. nanosecs per operation
Moka Sync Cache 100,000 n/a 1 14,694,850 16,407,702 10.439% 572,852k 21.595 s 694 ns
TinyUFO 100,000 100,000 1 15,685,418 16,407,702 4.402% 570,560k 13.077 s 407 ns
TinyUFO 100,000 1,000,000 1 15,074,058 16,407,702 8.128% 633,272k 21.706 s 689 ns
Moka Sync Cache 400,000 n/a 1 9,439,274 16,407,702 42.470% 725,132k 21.926 s 848 ns
TinyUFO 400,000 400,000 1 11,456,282 16,407,702 30.177% 723,656k 17.174 s 616 ns
TinyUFO 400,000 4,000,000 1 10,948,476 16,407,702 33.272% 994,580k 19.670 s 719 ns
Moka Sync Cache 800,000 n/a 1 4,868,066 16,407,702 70.331% 927,520k 15.246 s 716 ns
TinyUFO 800,000 800,000 1 5,597,449 16,407,702 65.885% 929,512k 12.203 s 555 ns
TinyUFO 800,000 8,000,000 1 5,597,708 16,407,702 65.884% 1.4g 13.530 s 614 ns

It seems the current TinyUFO implementation can be improved for better hit-miss ratio and smaller memory footprint. But even though it is improved, I am afraid that the performance gain might be small.