Open andrii0lomakin opened 8 years ago
This is very nicely thought out. If you don't mind, I have a few minor questions.
1. Will the "operations buffer" be used for read+write or only reads? In Caffeine the read buffer (ring buffers) are lossy, so if they fill to capacity the additions are discarded instead of waiting for a free space. This is okay because cache accesses patterns follow the 80/20 rule, so most accesses are to the same hot entries. Since we're capturing enough information to know which are hot we can prefer to minimize latency instead of being strict in our recording phase.
However writes cannot be lost but are a much rarer event. We use a dedicated buffer for writes instead of blocking on the exclusive lock. For efficiency we're using JCTools' MpscChunkedArrayQueue
. When a write is recorded we prioritize draining the buffers as an eviction might be required.
2. Do you plan on revisiting the 2Q policy and considering alternatives? TinyLFU is very simple to implement and provides a superior hit rate.
3. What are the pros/cons of a custom implementation vs. using Caffeine directly? I can imagine some lifecycle differences, a desire to own the code, etc. Hopefully you can take advantage of pulling code out of Caffeine where advantageous.
Let me know if I can be of help.
Hi @ben-manes , thank you very much for the feedback.
About your questions.
Will the "operations buffer" be used for read+write or only reads?
We supposed to use a separate queue to log additions of new entries. That will be cool to use MpscChunkedArrayQueue
as you suggest, but we tend to minimize an amount of dependencies, so I am not sure that it will be approved.
Do you plan on revisiting the 2Q policy and considering alternatives?
Not yet, we tried to use LIRS and TinyLFU. LIRS results were even worse than 2Q and for TinyLFU cache hits were the same as for 2Q. Maybe we missed something. We are going to create a tool to gather traces of cache access from production deployments and will rerun our tests again.
What are the pros/cons of a custom implementation vs. using Caffeine directly?
Maybe I am wrong, but Caffeine does not support acquire/release semantic. I mean if I cache#get() data and they are in use, they may be removed from the cache. But in our design cache is a single source of pages for all DB components also it tracks dirty pages and flushes them to the disk when they "released" back to the cache. So pages can not be evicted when they get (acquired) from the cache and used by components.
We are going to create a tool to gather traces of cache access from production deployments and will rerun our tests again.
This would be great. I'd appreciate it if you can supply traces so that I can digest them with my simulator. It operates on long
keys (no values) so it is easy to obfuscate using a hash.
Caffeine does not support acquire/release semantic
Yes, this is not a use-case it was designed for as it is not a general purpose scenario. It can be emulated in awkward ways, which is good enough for limited cases but perhaps not yours.
The first way is to use a CacheWriter
to capture the entry when incorrectly evicted. The writer would add it into a map and asynchronously reinsert it. A CacheLoader
would load from the map prior to the system-of-record. This latter part avoids a race, but assumes you operate only by get
and not getIfPresent
.
The second way is to use zero weights
to disqualify the entry from size eviction. However, that has to be done carefully as weighing occurs outside of any critical section, so you are prone to race conditions on rapid acquire/release. You'd have to lock externally around the write operation (e.g. Striped
) to avoid a stale weight.
But the fact that it might be hacked on may not be good enough, so a custom implementation is very reasonable.
The only open question which I see is when to free direct memory pointers when they removed during cache entry eviction. Because of step
If requested page has the given state, it should be deleted from the cache (in such way we help eviction process to remove it), but the content of this page is copied and inserted back to cache with "released" state unless the page is not inserted in other thread. This trick is needed to keep thread safety guarantees during page eviction
we may access a page which is already reclaimed to the pool of pages. There we may use one of three strategies (will check later which one is the best) : quiescent-state-based reclamation, epoch-based reclamation, and safe memory reclamation.
You might also leverage to phantom references, either to always defer or as a failsafe, to release native resources when the page is garbage collected.
Latest log from YCSB tests
2016-12-23 09:35:25:273 2410 sec: 90633290 operations; 54936.6 current ops/sec; est completion in 4 minutes [READ: Count=549369, Max=287743, Min=22, Avg=144.16, 90=318, 99=491, 99.9=909, 99.99=2747]
2016-12-23 09:35:42:196 2426 sec: 90872329 operations; 14124.26 current ops/sec; est completion in 4 minutes [READ: Count=239034, Max=12075007, Min=23, Avg=312.36, 90=318, 99=487, 99.9=772, 99.99=2703]
2016-12-23 09:35:45:273 2430 sec: 91052631 operations; 58615.73 current ops/sec; est completion in 3 minutes [READ: Count=180305, Max=12075007, Min=20, Avg=469.78, 90=318, 99=497, 99.9=803, 99.99=2835]
2016-12-23 09:35:55:273 2440 sec: 91598434 operations; 54580.3 current ops/sec; est completion in 3 minutes [READ: Count=545810, Max=270335, Min=23, Avg=145.09, 90=318, 99=480, 99.9=674, 99.99=2493]
2016-12-23 09:36:05:273 2450 sec: 92137313 operations; 53887.9 current ops/sec; est completion in 3 minutes [READ: Count=538872, Max=210943, Min=23, Avg=146.95, 90=319, 99=489, 99.9=746, 99.99=10191]
2016-12-23 09:36:15:273 2460 sec: 92690012 operations; 55269.9 current ops/sec; est completion in 3 minutes [READ: Count=552697, Max=251007, Min=21, Avg=143.29, 90=316, 99=482, 99.9=686, 99.99=2443]
2016-12-23 09:36:25:273 2470 sec: 93236418 operations; 54640.6 current ops/sec; est completion in 3 minutes [READ: Count=546407, Max=329983, Min=23, Avg=144.91, 90=318, 99=484, 99.9=688, 99.99=2595]
2016-12-23 09:36:35:273 2480 sec: 93760678 operations; 52426 current ops/sec; est completion in 2 minutes [READ: Count=524263, Max=361983, Min=21, Avg=151.11, 90=317, 99=482, 99.9=683, 99.99=2645]
2016-12-23 09:36:45:273 2490 sec: 94291130 operations; 53045.2 current ops/sec; est completion in 2 minutes [READ: Count=530450, Max=348159, Min=23, Avg=149.32, 90=317, 99=484, 99.9=695, 99.99=2413]
2016-12-23 09:36:55:273 2500 sec: 94841547 operations; 55041.7 current ops/sec; est completion in 2 minutes [READ: Count=550417, Max=235135, Min=20, Avg=143.49, 90=317, 99=483, 99.9=691, 99.99=2609]
2016-12-23 09:37:05:435 2510 sec: 95398626 operations; 54814.42 current ops/sec; est completion in 2 minutes [READ: Count=557081, Max=244479, Min=23, Avg=144.45, 90=317, 99=482, 99.9=669, 99.99=2505]
2016-12-23 09:37:15:273 2520 sec: 95939999 operations; 55034.36 current ops/sec; est completion in 1 minutes [READ: Count=541371, Max=317695, Min=22, Avg=143.89, 90=318, 99=481, 99.9=663, 99.99=1798]
2016-12-23 09:37:25:273 2530 sec: 96468361 operations; 52836.2 current ops/sec; est completion in 1 minutes [READ: Count=528366, Max=327935, Min=23, Avg=149.92, 90=318, 99=484, 99.9=690, 99.99=2633]
2016-12-23 09:37:35:273 2540 sec: 97000192 operations; 53183.1 current ops/sec; est completion in 1 minutes [READ: Count=531829, Max=322815, Min=23, Avg=148.94, 90=317, 99=478, 99.9=678, 99.99=2557]
2016-12-23 09:37:45:503 2550 sec: 97556236 operations; 54348.94 current ops/sec; est completion in 1 minutes [READ: Count=556041, Max=251647, Min=23, Avg=143.44, 90=317, 99=485, 99.9=684, 99.99=2393]
2016-12-23 09:37:55:273 2560 sec: 98096151 operations; 55268.2 current ops/sec; est completion in 50 seconds [READ: Count=539916, Max=257919, Min=23, Avg=145.59, 90=318, 99=486, 99.9=691, 99.99=2551]
2016-12-23 09:38:05:273 2570 sec: 98636877 operations; 54072.6 current ops/sec; est completion in 36 seconds [READ: Count=540729, Max=263679, Min=23, Avg=146.46, 90=318, 99=482, 99.9=686, 99.99=2215]
2016-12-23 09:38:15:272 2580 sec: 99169976 operations; 53309.9 current ops/sec; est completion in 22 seconds [READ: Count=533094, Max=611839, Min=23, Avg=148.58, 90=318, 99=484, 99.9=700, 99.99=2567]
2016-12-23 09:38:32:321 2597 sec: 99388106 operations; 12794.3 current ops/sec; est completion in 16 seconds [READ: Count=218130, Max=12304383, Min=23, Avg=398.29, 90=316, 99=480, 99.9=694, 99.99=2527]
2016-12-23 09:38:35:273 2600 sec: 99562485 operations; 59091.49 current ops/sec; est completion in 12 seconds [READ: Count=174377, Max=12304383, Min=17, Avg=416.08, 90=317, 99=496, 99.9=723, 99.99=1660]
2016-12-23 09:38:45:273 2610 sec: 99986822 operations; 42433.7 current ops/sec; est completion in 1 seconds [READ: Count=424335, Max=285183, Min=20, Avg=139.32, 90=308, 99=460, 99.9=650, 99.99=1772][CLEANUP: Count=7, Max=3, Min=1, Avg=1.86, 90=2, 99=3, 99.9=3, 99.99=3]
so as you can see read perfromance in general high but , because presence of global lock on cache eviction we have periodical slow down. This problem may be fixed by given change.
Implemented in 3.0.15 and 3.1 versions
Oh cool. Can you point me to the code? Curious to see what you built.
Thanks! Glancing through it, a few ideas that you might find interesting.
Otherwise looks great :)
@ben-manes . Super cool. Thank you very much for such feedback.
Guys I will reopen issue to incorporate further improvements.
Here's some fun examples of the adaptivity correcting the configuration. Corda is recency-skewed and starts with a small window (LFU). Loop is frequency-skewed and starts with a large window (LRU).
Reference:
https://github.com/orientechnologies/orientdb-labs/blob/master/OEP_8.md
Summary:
Asynchronous read cache with state machine
Goals:
Non-Goals:
Success metrics:
Motivation:
To implement thread safety guarantees inside of read cache we use implementation of group locks which are close to Google Striped. But this approach has couple of disadvantages:
An alternative approach is proposed to overcome these disadvantages. It is proposed to:
The proposed design is an adoption of the design of Caffeine framework which has excellent scalability characteristics.
Description:
Current workflow of 2Q cache looks like following:
As alternative following design is proposed:
Lock-free operations buffer.
To gather statics, we will use ring buffer which will be implemented using a plain array. But this buffer will be presented as not a single instance of the array but as an array of arrays. Each of the arrays will be used by a subset of threads to minimize contention between threads. If threshold on one of those arrays will be reached, all those arrays will be emptied by one of the threads by applying tryLock operation which prevents contention between threads. Lock mentioned above is used only during buffer flush and is not used during logging of a statistic inside of buffers. Pointers inside of buffer will be implemented without CAS operations, as a result, few of operations will be lost, but it will not cause significant changes in overall statistics. There is limit on amount of records which will be flushed at once per single thread. Eache thread flushes no more than 2 * threshold amount of elements from buffer.
State machine.
To achieve thread safety guarantees a state machine with following states will be introduced:
Eviction process
The removal process is performed during operations buffer flush (when the state of eviction policy is updated). During this process:
About ETA, we have already implemented a similar algorithm (different state machine) in the file auto-close functionality. So ETA for the worst case is 10 days , but probably smaller.
Alternatives:
There is no any other proposal for cache lock models at the moment.
Risks and assumptions:
In the case of incorrect or poorly tested implementation, there is a risk of data corruption.
Impact matrix