Closed 1a1a11a closed 9 months ago
I think I've already fixed the first two problems in the dev branch https://github.com/maypok86/otter/tree/dev
Interesting idea, I'll have to try it. So far I've limited the size of the ghost queue to the number of items in the small and main queues and it works fine in principle, if it weren't for the S3 and DS1 traces.
Hmmm, that sounds very true, I probably just made up my own non-existent cases 🙂
P.S. I tried debugging and if is really not needed
Awesome work! :) About scalability, I think if you want to improve scalability, one easy change is to make read, write, and eviction concurrent using atomics.
@1a1a11a Speed is otter's best feature by a huge margin so far, so for me the higher priority for now is
Make sense!
The problem with the expiration policy is that I would like to have a proactive expiration policy that has an amortized complexity of O(1) and requires less than 16 bytes per object. So far I'm leaning towards the timer wheel, even though it requires 16 bytes per object (prev and next pointers), but using this solution makes it easy to make a separate policy for a constant ttl using a regular fifo queue (which is the most common need when using a cache library). The segcache expiration policy looks suitable, except for the possibility to remove a random element from the expiration policy (I seem to know how to solve this, but it will have a very small memory footprint advantage in the end) and it is not clear how to make a nice expiration policy for constant ttl from this solution. Right now I've so far implemented the expiration policy from the https://dl.acm.org/doi/fullHtml/10.1145/3422575.3422797 paper, but it relies too heavily on the randomized algorithm from Redis and doesn't make it easy to implement an expiration policy for constant ttl.
This is an ambiguous point. The main target audience of otter is really web applications. But various database engines are written in Golang more and more often lately and they even more need out of the box cache library solutions because they unlike normal web applications can't use redis or memcached. For example, ristretto was written specifically for dgraph and badger databases and it is also being used by vitess and spicedb.
Sounds like a simple and efficient solution, yes.
Although I doubt such DBs would have traffic similar to DS1 or S3 traces...
I think I've corrected all the comments. It is better to talk about S3-FIFO in another issue.
About the expiration policy, I might have missed something. It looks like you would need to re-distribute all items (with lock) when the timing wheel turns every time. Is that correct?
Eh, I got too caught up in preparing the first otter release and trying to reach for a star hundreds of millions of qps.
A standard timer wheel implementation would require this, but there are variants that guarantee amortized O(1) time complexity.
I plan to use a hashed hierarchical timing wheel.
Original paper Article from the Kafka developers on how they use this data structure Article comparing different types of timing wheels
HI @maypok86, Thank you for sharing these blogs! I am aware of hashed and hierarchical timing wheels. But they are two different timing wheels (hashed timing wheel and hierarchical timing wheel) with different tradeoffs. What do you mean by hashed hierarchical timing wheel?
My guess is that most people go for hierarchical timing wheel because the per-tick time complexity for hashed timing wheel is too high. But as I mentioned earlier, the problem with hierarchical timing wheel causes high tail latency (when the upper level timing wheel move by one tick and it requires a locking). For example, when the minute wheel
tick by 1, all events belong to the next minute will be distributed to the second timing wheel
. It is worse when the day timing wheel ticks, because all expired objects in the next day will be moved to the hour wheel
, this requires holding the hold for a very long time.
Oh @1a1a11a, I apologize for the confusion. I was of course referring to the hierarchical timer wheel, "hashed" just refers to the fact that this wheel doesn't need sorting and uses hashing instead of comparing.
The day wheel expiration thing is something to think about. Sounds like something really nasty, especially since I see setting ttl to 1 day quite often in my work.😞
Maybe there are some approximations can be done to address the problem. Or we can use the Memcached's approach that uses a dedicated thread to move 100 (number is made up) entries each time, and then releases lock and sleep for some time before next operation.
But I think even the hourly timing wheel will be an issue. Assuming a 256GB cache with mean object size of 256 bytes, 256GB / 256B = 1billion objects are stored. If we assume the expiration time is evenly distributed in a week, then there are 1 billion / (7 * 24) = 7 million objects expire each hour, so each time the hourly wheel ticks, you need to process 7 million objects, which would take 10s to 100s milliseconds.
Hi @maypok86, it was great to see the hackernews post attracts a lot of attention! I was reviewing the figures on readme. The first thing that comes to me is the weird results on Zipf workloads, do you mind explaining how they were generated? On my end, I am seeing S3-FIFO outperforming ARC on almost all sizes (although not by a large margin). This uses a Zipf 1.0 trace of 0.72 million objects and 10 million requests.
ARC cache size 1000, 10000000 req, miss ratio 0.4970
ARC cache size 5000, 10000000 req, miss ratio 0.3872
ARC cache size 10000, 10000000 req, miss ratio 0.3405
ARC cache size 50000, 10000000 req, miss ratio 0.2344
ARC cache size 100000, 10000000 req, miss ratio 0.1900
ARC cache size 500000, 10000000 req, miss ratio 0.0905
ARC cache size 100000, 10000000 req, miss ratio 0.1900
ARC cache size 200000, 10000000 req, miss ratio 0.1470
S3FIFO-0.1000-2 cache size 1000, 10000000 req, miss ratio 0.4952
S3FIFO-0.1000-2 cache size 5000, 10000000 req, miss ratio 0.3851
S3FIFO-0.1000-2 cache size 10000, 10000000 req, miss ratio 0.3380
S3FIFO-0.1000-2 cache size 50000, 10000000 req, miss ratio 0.2320
S3FIFO-0.1000-2 cache size 100000, 10000000 req, miss ratio 0.1884
S3FIFO-0.1000-2 cache size 500000, 10000000 req, miss ratio 0.0922
S3FIFO-0.1000-2 cache size 100000, 10000000 req, miss ratio 0.1884
S3FIFO-0.1000-2 cache size 200000, 10000000 req, miss ratio 0.1465
Hi @1a1a11a, I wasn't even worried about that :). The main thing for me was that S3-FIFO showed a good result.
Actually, there might be something wrong here, because I just took zipf parameters from ristretto benchmarks and didn't check them in any way. Here are the parameters of zipf distribution , and here is the description of these parameters and this is run with 1000000 requests.
I tested otter along with a similar s3-fifo implementation https://github.com/scalalang2/golang-fifo and they seem to have about the same results.
Zipf:
Capacity: 500
ARC ratio: 24.19
Otter ratio: 24.29
S3FIFO ratio: 24.33
Capacity: 1000
ARC ratio: 28.60
Otter ratio: 28.42
S3FIFO ratio: 28.48
Capacity: 2000
ARC ratio: 32.94
Otter ratio: 32.64
S3FIFO ratio: 32.76
Capacity: 5000
ARC ratio: 38.12
Otter ratio: 37.65
S3FIFO ratio: 37.88
Capacity: 10000
ARC ratio: 41.92
Otter ratio: 41.10
S3FIFO ratio: 41.14
Capacity: 20000
ARC ratio: 44.94
Otter ratio: 43.93
S3FIFO ratio: 43.90
Capacity: 40000
ARC ratio: 47.42
Otter ratio: 46.38
S3FIFO ratio: 46.36
Capacity: 80000
ARC ratio: 49.83
Otter ratio: 48.80
S3FIFO ratio: 48.90
Hi @maypok86, Thank you for the explanations! How was the miss ratio of ARC obtained?
Oh, my comment may be confusing, I was wondering where is the ARC implementation
Oh, it uses the most famous implementation from hashicorp https://github.com/hashicorp/golang-lru/blob/main/arc/arc.go
The LRU implementation is also taken from the same repository.
Thank you! The implementation matches the libCacheSim results. :)
If my understanding is correct, the current implementation has a bug that may evict more objects than needed. I believe it should
return
at this line instead of continuing to evict because an object has already been evicted from the main queue https://github.com/maypok86/otter/blob/main/internal/s3fifo/small.go#L62As you have mentioned (https://github.com/Yiling-J/theine-go/issues/29), it is better to store a fingerprint instead of the full object in the ghost. Storing data in the ghost defeats the purpose of using a ghost queue...
the size of the ghost queue does not need to be accurate, so you can estimate using the total cache size divided by the mean object size. VMware found that the ghost queue size has little impact between 40% and 200% of the main queue size (in terms of #objects).
there are multiple if statement seems not necessary, e.g., https://github.com/maypok86/otter/blob/main/internal/s3fifo/small.go#L55