maypok86 / otter

A high performance cache for Go
https://maypok86.github.io/otter/
Apache License 2.0
1.66k stars 41 forks source link

Some potential bug and improvements #17

Closed 1a1a11a closed 9 months ago

1a1a11a commented 10 months ago
  1. If my understanding is correct, the current implementation has a bug that may evict more objects than needed. I believe it should return at this line instead of continuing to evict because an object has already been evicted from the main queue https://github.com/maypok86/otter/blob/main/internal/s3fifo/small.go#L62

  2. As you have mentioned (https://github.com/Yiling-J/theine-go/issues/29), it is better to store a fingerprint instead of the full object in the ghost. Storing data in the ghost defeats the purpose of using a ghost queue...

  3. the size of the ghost queue does not need to be accurate, so you can estimate using the total cache size divided by the mean object size. VMware found that the ghost queue size has little impact between 40% and 200% of the main queue size (in terms of #objects).

  4. there are multiple if statement seems not necessary, e.g., https://github.com/maypok86/otter/blob/main/internal/s3fifo/small.go#L55

maypok86 commented 10 months ago

I think I've already fixed the first two problems in the dev branch https://github.com/maypok86/otter/tree/dev

  1. Interesting idea, I'll have to try it. So far I've limited the size of the ghost queue to the number of items in the small and main queues and it works fine in principle, if it weren't for the S3 and DS1 traces.

  2. Hmmm, that sounds very true, I probably just made up my own non-existent cases 🙂

P.S. I tried debugging and if is really not needed

1a1a11a commented 10 months ago

Awesome work! :) About scalability, I think if you want to improve scalability, one easy change is to make read, write, and eviction concurrent using atomics.

maypok86 commented 10 months ago

@1a1a11a Speed is otter's best feature by a huge margin so far, so for me the higher priority for now is

  1. Add normal support for deleting random items
  2. Improve expiration policy
  3. Improve hit ratio on S3, DS1 and similar traces (I even agree to be a guinea pig 😁).
  4. Thinking about whether eviction for O(n) in the worst case is so terrible
1a1a11a commented 10 months ago

Make sense!

  1. If you are interested in an advanced expiration technique, we have designed Segcache, which removes expired objects within one second after expiration (more details are here https://www.usenix.org/conference/nsdi21/presentation/yang-juncheng). But more likely a simple strategy would suffice because short-lived data will be removed at the small queue boundary.
  2. If your target audience and workload are web applications (not serving page cache traffic), I would not worry too much about S3 and DS1. If you are interested, we can also find a way to collaborate on this.
  3. The worst case O(n) is indeed one concern, I like Ben's suggestion on evicting a random object if some number (e.g., 10 or 20) of objects are re-inserted. This number will probably depend on your tail latency SLA.
  4. Love to chat more if you are interested! :)
maypok86 commented 10 months ago
  1. The problem with the expiration policy is that I would like to have a proactive expiration policy that has an amortized complexity of O(1) and requires less than 16 bytes per object. So far I'm leaning towards the timer wheel, even though it requires 16 bytes per object (prev and next pointers), but using this solution makes it easy to make a separate policy for a constant ttl using a regular fifo queue (which is the most common need when using a cache library). The segcache expiration policy looks suitable, except for the possibility to remove a random element from the expiration policy (I seem to know how to solve this, but it will have a very small memory footprint advantage in the end) and it is not clear how to make a nice expiration policy for constant ttl from this solution. Right now I've so far implemented the expiration policy from the https://dl.acm.org/doi/fullHtml/10.1145/3422575.3422797 paper, but it relies too heavily on the randomized algorithm from Redis and doesn't make it easy to implement an expiration policy for constant ttl.

  2. This is an ambiguous point. The main target audience of otter is really web applications. But various database engines are written in Golang more and more often lately and they even more need out of the box cache library solutions because they unlike normal web applications can't use redis or memcached. For example, ristretto was written specifically for dgraph and badger databases and it is also being used by vitess and spicedb.

  3. Sounds like a simple and efficient solution, yes.

maypok86 commented 10 months ago

Although I doubt such DBs would have traffic similar to DS1 or S3 traces...

maypok86 commented 9 months ago

I think I've corrected all the comments. It is better to talk about S3-FIFO in another issue.

1a1a11a commented 9 months ago

About the expiration policy, I might have missed something. It looks like you would need to re-distribute all items (with lock) when the timing wheel turns every time. Is that correct?

maypok86 commented 9 months ago

Eh, I got too caught up in preparing the first otter release and trying to reach for a star hundreds of millions of qps.

A standard timer wheel implementation would require this, but there are variants that guarantee amortized O(1) time complexity.

I plan to use a hashed hierarchical timing wheel.

Original paper Article from the Kafka developers on how they use this data structure Article comparing different types of timing wheels

1a1a11a commented 9 months ago

HI @maypok86, Thank you for sharing these blogs! I am aware of hashed and hierarchical timing wheels. But they are two different timing wheels (hashed timing wheel and hierarchical timing wheel) with different tradeoffs. What do you mean by hashed hierarchical timing wheel?

My guess is that most people go for hierarchical timing wheel because the per-tick time complexity for hashed timing wheel is too high. But as I mentioned earlier, the problem with hierarchical timing wheel causes high tail latency (when the upper level timing wheel move by one tick and it requires a locking). For example, when the minute wheel tick by 1, all events belong to the next minute will be distributed to the second timing wheel. It is worse when the day timing wheel ticks, because all expired objects in the next day will be moved to the hour wheel, this requires holding the hold for a very long time.

maypok86 commented 9 months ago

Oh @1a1a11a, I apologize for the confusion. I was of course referring to the hierarchical timer wheel, "hashed" just refers to the fact that this wheel doesn't need sorting and uses hashing instead of comparing.

The day wheel expiration thing is something to think about. Sounds like something really nasty, especially since I see setting ttl to 1 day quite often in my work.😞

1a1a11a commented 9 months ago

Maybe there are some approximations can be done to address the problem. Or we can use the Memcached's approach that uses a dedicated thread to move 100 (number is made up) entries each time, and then releases lock and sleep for some time before next operation.

But I think even the hourly timing wheel will be an issue. Assuming a 256GB cache with mean object size of 256 bytes, 256GB / 256B = 1billion objects are stored. If we assume the expiration time is evenly distributed in a week, then there are 1 billion / (7 * 24) = 7 million objects expire each hour, so each time the hourly wheel ticks, you need to process 7 million objects, which would take 10s to 100s milliseconds.

1a1a11a commented 9 months ago

Hi @maypok86, it was great to see the hackernews post attracts a lot of attention! I was reviewing the figures on readme. The first thing that comes to me is the weird results on Zipf workloads, do you mind explaining how they were generated? On my end, I am seeing S3-FIFO outperforming ARC on almost all sizes (although not by a large margin). This uses a Zipf 1.0 trace of 0.72 million objects and 10 million requests.

ARC cache size     1000, 10000000 req, miss ratio 0.4970
ARC cache size     5000, 10000000 req, miss ratio 0.3872
ARC cache size    10000, 10000000 req, miss ratio 0.3405
ARC cache size    50000, 10000000 req, miss ratio 0.2344
ARC cache size   100000, 10000000 req, miss ratio 0.1900
ARC cache size   500000, 10000000 req, miss ratio 0.0905
ARC cache size   100000, 10000000 req, miss ratio 0.1900
ARC cache size   200000, 10000000 req, miss ratio 0.1470
S3FIFO-0.1000-2 cache size     1000, 10000000 req, miss ratio 0.4952
S3FIFO-0.1000-2 cache size     5000, 10000000 req, miss ratio 0.3851
S3FIFO-0.1000-2 cache size    10000, 10000000 req, miss ratio 0.3380
S3FIFO-0.1000-2 cache size    50000, 10000000 req, miss ratio 0.2320
S3FIFO-0.1000-2 cache size   100000, 10000000 req, miss ratio 0.1884
S3FIFO-0.1000-2 cache size   500000, 10000000 req, miss ratio 0.0922
S3FIFO-0.1000-2 cache size   100000, 10000000 req, miss ratio 0.1884
S3FIFO-0.1000-2 cache size   200000, 10000000 req, miss ratio 0.1465
maypok86 commented 9 months ago

Hi @1a1a11a, I wasn't even worried about that :). The main thing for me was that S3-FIFO showed a good result.

Actually, there might be something wrong here, because I just took zipf parameters from ristretto benchmarks and didn't check them in any way. Here are the parameters of zipf distribution , and here is the description of these parameters and this is run with 1000000 requests.

I tested otter along with a similar s3-fifo implementation https://github.com/scalalang2/golang-fifo and they seem to have about the same results.

Zipf:
        Capacity: 500
                ARC ratio: 24.19
                Otter ratio: 24.29
                S3FIFO ratio: 24.33
        Capacity: 1000
                ARC ratio: 28.60
                Otter ratio: 28.42
                S3FIFO ratio: 28.48
        Capacity: 2000
                ARC ratio: 32.94
                Otter ratio: 32.64
                S3FIFO ratio: 32.76
        Capacity: 5000
                ARC ratio: 38.12
                Otter ratio: 37.65
                S3FIFO ratio: 37.88
        Capacity: 10000
                ARC ratio: 41.92
                Otter ratio: 41.10
                S3FIFO ratio: 41.14
        Capacity: 20000
                ARC ratio: 44.94
                Otter ratio: 43.93
                S3FIFO ratio: 43.90
        Capacity: 40000
                ARC ratio: 47.42
                Otter ratio: 46.38
                S3FIFO ratio: 46.36
        Capacity: 80000
                ARC ratio: 49.83
                Otter ratio: 48.80
                S3FIFO ratio: 48.90
1a1a11a commented 9 months ago

Hi @maypok86, Thank you for the explanations! How was the miss ratio of ARC obtained?

maypok86 commented 9 months ago

@1a1a11a in the usual way. If the item was in the cache, we increment the number of hits, otherwise we increment the number of misses. And then we calculate the hit ratio using the formula: 100 * (hits / (hits+misses))

1a1a11a commented 9 months ago

Oh, my comment may be confusing, I was wondering where is the ARC implementation

maypok86 commented 9 months ago

Oh, it uses the most famous implementation from hashicorp https://github.com/hashicorp/golang-lru/blob/main/arc/arc.go

maypok86 commented 9 months ago

The LRU implementation is also taken from the same repository.

1a1a11a commented 9 months ago

Thank you! The implementation matches the libCacheSim results. :)