[OEP 8] Asynchronous read cache with state machine

andrii0lomakin commented 8 years ago

Reference:

https://github.com/orientechnologies/orientdb-labs/blob/master/OEP_8.md

Summary:

Asynchronous read cache with state machine

Goals:

Remove exclusive locks on read-only benchmarks.
Increase scalability of read cache in general.

Non-Goals:

Improve speed in single thread case.

Success metrics:

Improved throughput of read only YCSB workloads.
Improved performance of all YCSB workloads.

Motivation:

To implement thread safety guarantees inside of read cache we use implementation of group locks which are close to Google Striped. But this approach has couple of disadvantages:

Even in a case of read-only benchmarks, we need to acquire an exclusive lock on the cache page to move page between queues of 2Q cache. There are https://screencloud.net/v/3PZd states of threads in case of YCSB "read only" benchmark. All "wait" states of threads are caused by exclusive page locks inside of 2Q cache.
Very often to evict pages from cache in case of cache overflow we need to apply an exclusive lock on all cache which leads to high degradation of system scalability.

An alternative approach is proposed to overcome these disadvantages. It is proposed to:

Gather statistics about usage of pages in an asynchronous manner which decrease the precision of cache but it will allow avoiding locking of pages during page load from cache.
It is proposed to create a state machine to keep thread safety guarantees.

The proposed design is an adoption of the design of Caffeine framework which has excellent scalability characteristics.

Description:

Current workflow of 2Q cache looks like following:

System requests page from cache.
"Striped like" lock is applied on a page.
The page either loaded from cache or disk.
The page is moved between queues of 2Q cache.
The page is unlocked and returned to the requestor.

As alternative following design is proposed:

All pages are stored not inside of several queues but inside of the single concurrent hash map. So we split cache content and its state. The last one, as usual, is stored inside of queues.
When a page is requested it is loaded from a hash map but locks are not applied on the page. Instead, operations are logged into the lock-free buffer.
State of the page is changed according to state machine rules, so page either returned by requestor which prevents removal of the page from cache during an eviction or will be reloaded again from cache (very rare case) and will be again returned to a user.
If a threshold of operations buffer is reached it if flushed by one of the threads and a state of 2Q eviction policy will be updated accordingly.
During phase 4 if the cache is overflowed pages will be removed from the cache by one of the threads if the state of the page allows removing the page (page is not used).

Lock-free operations buffer.

To gather statics, we will use ring buffer which will be implemented using a plain array. But this buffer will be presented as not a single instance of the array but as an array of arrays. Each of the arrays will be used by a subset of threads to minimize contention between threads. If threshold on one of those arrays will be reached, all those arrays will be emptied by one of the threads by applying tryLock operation which prevents contention between threads. Lock mentioned above is used only during buffer flush and is not used during logging of a statistic inside of buffers. Pointers inside of buffer will be implemented without CAS operations, as a result, few of operations will be lost, but it will not cause significant changes in overall statistics. There is limit on amount of records which will be flushed at once per single thread. Eache thread flushes no more than 2 * threshold amount of elements from buffer.

State machine.

To achieve thread safety guarantees a state machine with following states will be introduced:

Acquired - when a page in "acquired" state, it is impossible to remove a page from cache.
Released - a page is free to acquire it by cache user.
Removed - this state is set by eviction procedure before the page will be withdrawn from the cache. If requested page has the given state, it should be deleted from the cache (in such way we help eviction process to remove it), but the content of this page is copied and inserted back to cache with "released" state unless the page is not inserted in other thread. This trick is needed to keep thread safety guarantees during page eviction.

Eviction process

The removal process is performed during operations buffer flush (when the state of eviction policy is updated). During this process:

State of a page which is going to be removed from the cache is changed to "removed" state.
If the state is not allowed to be changed then, next eviction candidate is taken.
If the state is changed, a cache entry is removed from the concurrent hash map. To prevent case when we remove the reloaded page (see "state machine" description item 3) concurrent hash map item is removed using method map.remove(key, value) https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html#remove-java.lang.Object-java.lang.Object- .

About ETA, we have already implemented a similar algorithm (different state machine) in the file auto-close functionality. So ETA for the worst case is 10 days , but probably smaller.

Alternatives:

There is no any other proposal for cache lock models at the moment.

Risks and assumptions:

In the case of incorrect or poorly tested implementation, there is a risk of data corruption.

Impact matrix

[x] Storage engine
[ ] SQL
[ ] Protocols
[ ] Indexes
[ ] Console
[ ] Java API
[ ] Geospatial
[ ] Lucene
[ ] Security
[ ] Hooks
[ ] EE

ben-manes commented 8 years ago

This is very nicely thought out. If you don't mind, I have a few minor questions.

1. Will the "operations buffer" be used for read+write or only reads? In Caffeine the read buffer (ring buffers) are lossy, so if they fill to capacity the additions are discarded instead of waiting for a free space. This is okay because cache accesses patterns follow the 80/20 rule, so most accesses are to the same hot entries. Since we're capturing enough information to know which are hot we can prefer to minimize latency instead of being strict in our recording phase.

However writes cannot be lost but are a much rarer event. We use a dedicated buffer for writes instead of blocking on the exclusive lock. For efficiency we're using JCTools' MpscChunkedArrayQueue. When a write is recorded we prioritize draining the buffers as an eviction might be required.

2. Do you plan on revisiting the 2Q policy and considering alternatives? TinyLFU is very simple to implement and provides a superior hit rate.

3. What are the pros/cons of a custom implementation vs. using Caffeine directly? I can imagine some lifecycle differences, a desire to own the code, etc. Hopefully you can take advantage of pulling code out of Caffeine where advantageous.

Let me know if I can be of help.

andrii0lomakin commented 8 years ago

Hi @ben-manes , thank you very much for the feedback.

About your questions.

Will the "operations buffer" be used for read+write or only reads?

We supposed to use a separate queue to log additions of new entries. That will be cool to use MpscChunkedArrayQueue as you suggest, but we tend to minimize an amount of dependencies, so I am not sure that it will be approved.

Do you plan on revisiting the 2Q policy and considering alternatives?

Not yet, we tried to use LIRS and TinyLFU. LIRS results were even worse than 2Q and for TinyLFU cache hits were the same as for 2Q. Maybe we missed something. We are going to create a tool to gather traces of cache access from production deployments and will rerun our tests again.

What are the pros/cons of a custom implementation vs. using Caffeine directly?

Maybe I am wrong, but Caffeine does not support acquire/release semantic. I mean if I cache#get() data and they are in use, they may be removed from the cache. But in our design cache is a single source of pages for all DB components also it tracks dirty pages and flushes them to the disk when they "released" back to the cache. So pages can not be evicted when they get (acquired) from the cache and used by components.

ben-manes commented 8 years ago

We are going to create a tool to gather traces of cache access from production deployments and will rerun our tests again.

This would be great. I'd appreciate it if you can supply traces so that I can digest them with my simulator. It operates on long keys (no values) so it is easy to obfuscate using a hash.

Caffeine does not support acquire/release semantic

Yes, this is not a use-case it was designed for as it is not a general purpose scenario. It can be emulated in awkward ways, which is good enough for limited cases but perhaps not yours.

The first way is to use a CacheWriter to capture the entry when incorrectly evicted. The writer would add it into a map and asynchronously reinsert it. A CacheLoader would load from the map prior to the system-of-record. This latter part avoids a race, but assumes you operate only by get and not getIfPresent.

The second way is to use zero weights to disqualify the entry from size eviction. However, that has to be done carefully as weighing occurs outside of any critical section, so you are prone to race conditions on rapid acquire/release. You'd have to lock externally around the write operation (e.g. Striped) to avoid a stale weight.

But the fact that it might be hacked on may not be good enough, so a custom implementation is very reasonable.

andrii0lomakin commented 8 years ago

The only open question which I see is when to free direct memory pointers when they removed during cache entry eviction. Because of step

If requested page has the given state, it should be deleted from the cache (in such way we help eviction process to remove it), but the content of this page is copied and inserted back to cache with "released" state unless the page is not inserted in other thread. This trick is needed to keep thread safety guarantees during page eviction

we may access a page which is already reclaimed to the pool of pages. There we may use one of three strategies (will check later which one is the best) : quiescent-state-based reclamation, epoch-based reclamation, and safe memory reclamation.

ben-manes commented 8 years ago

You might also leverage to phantom references, either to always defer or as a failsafe, to release native resources when the page is garbage collected.

andrii0lomakin commented 7 years ago

Latest log from YCSB tests

2016-12-23 09:35:25:273 2410 sec: 90633290 operations; 54936.6 current ops/sec; est completion in 4 minutes [READ: Count=549369, Max=287743, Min=22, Avg=144.16, 90=318, 99=491, 99.9=909, 99.99=2747]
2016-12-23 09:35:42:196 2426 sec: 90872329 operations; 14124.26 current ops/sec; est completion in 4 minutes [READ: Count=239034, Max=12075007, Min=23, Avg=312.36, 90=318, 99=487, 99.9=772, 99.99=2703]
2016-12-23 09:35:45:273 2430 sec: 91052631 operations; 58615.73 current ops/sec; est completion in 3 minutes [READ: Count=180305, Max=12075007, Min=20, Avg=469.78, 90=318, 99=497, 99.9=803, 99.99=2835]
2016-12-23 09:35:55:273 2440 sec: 91598434 operations; 54580.3 current ops/sec; est completion in 3 minutes [READ: Count=545810, Max=270335, Min=23, Avg=145.09, 90=318, 99=480, 99.9=674, 99.99=2493]
2016-12-23 09:36:05:273 2450 sec: 92137313 operations; 53887.9 current ops/sec; est completion in 3 minutes [READ: Count=538872, Max=210943, Min=23, Avg=146.95, 90=319, 99=489, 99.9=746, 99.99=10191]
2016-12-23 09:36:15:273 2460 sec: 92690012 operations; 55269.9 current ops/sec; est completion in 3 minutes [READ: Count=552697, Max=251007, Min=21, Avg=143.29, 90=316, 99=482, 99.9=686, 99.99=2443]
2016-12-23 09:36:25:273 2470 sec: 93236418 operations; 54640.6 current ops/sec; est completion in 3 minutes [READ: Count=546407, Max=329983, Min=23, Avg=144.91, 90=318, 99=484, 99.9=688, 99.99=2595]
2016-12-23 09:36:35:273 2480 sec: 93760678 operations; 52426 current ops/sec; est completion in 2 minutes [READ: Count=524263, Max=361983, Min=21, Avg=151.11, 90=317, 99=482, 99.9=683, 99.99=2645]
2016-12-23 09:36:45:273 2490 sec: 94291130 operations; 53045.2 current ops/sec; est completion in 2 minutes [READ: Count=530450, Max=348159, Min=23, Avg=149.32, 90=317, 99=484, 99.9=695, 99.99=2413]
2016-12-23 09:36:55:273 2500 sec: 94841547 operations; 55041.7 current ops/sec; est completion in 2 minutes [READ: Count=550417, Max=235135, Min=20, Avg=143.49, 90=317, 99=483, 99.9=691, 99.99=2609]
2016-12-23 09:37:05:435 2510 sec: 95398626 operations; 54814.42 current ops/sec; est completion in 2 minutes [READ: Count=557081, Max=244479, Min=23, Avg=144.45, 90=317, 99=482, 99.9=669, 99.99=2505]
2016-12-23 09:37:15:273 2520 sec: 95939999 operations; 55034.36 current ops/sec; est completion in 1 minutes [READ: Count=541371, Max=317695, Min=22, Avg=143.89, 90=318, 99=481, 99.9=663, 99.99=1798]
2016-12-23 09:37:25:273 2530 sec: 96468361 operations; 52836.2 current ops/sec; est completion in 1 minutes [READ: Count=528366, Max=327935, Min=23, Avg=149.92, 90=318, 99=484, 99.9=690, 99.99=2633]
2016-12-23 09:37:35:273 2540 sec: 97000192 operations; 53183.1 current ops/sec; est completion in 1 minutes [READ: Count=531829, Max=322815, Min=23, Avg=148.94, 90=317, 99=478, 99.9=678, 99.99=2557]
2016-12-23 09:37:45:503 2550 sec: 97556236 operations; 54348.94 current ops/sec; est completion in 1 minutes [READ: Count=556041, Max=251647, Min=23, Avg=143.44, 90=317, 99=485, 99.9=684, 99.99=2393]
2016-12-23 09:37:55:273 2560 sec: 98096151 operations; 55268.2 current ops/sec; est completion in 50 seconds [READ: Count=539916, Max=257919, Min=23, Avg=145.59, 90=318, 99=486, 99.9=691, 99.99=2551]
2016-12-23 09:38:05:273 2570 sec: 98636877 operations; 54072.6 current ops/sec; est completion in 36 seconds [READ: Count=540729, Max=263679, Min=23, Avg=146.46, 90=318, 99=482, 99.9=686, 99.99=2215]
2016-12-23 09:38:15:272 2580 sec: 99169976 operations; 53309.9 current ops/sec; est completion in 22 seconds [READ: Count=533094, Max=611839, Min=23, Avg=148.58, 90=318, 99=484, 99.9=700, 99.99=2567]
2016-12-23 09:38:32:321 2597 sec: 99388106 operations; 12794.3 current ops/sec; est completion in 16 seconds [READ: Count=218130, Max=12304383, Min=23, Avg=398.29, 90=316, 99=480, 99.9=694, 99.99=2527]
2016-12-23 09:38:35:273 2600 sec: 99562485 operations; 59091.49 current ops/sec; est completion in 12 seconds [READ: Count=174377, Max=12304383, Min=17, Avg=416.08, 90=317, 99=496, 99.9=723, 99.99=1660]
2016-12-23 09:38:45:273 2610 sec: 99986822 operations; 42433.7 current ops/sec; est completion in 1 seconds [READ: Count=424335, Max=285183, Min=20, Avg=139.32, 90=308, 99=460, 99.9=650, 99.99=1772][CLEANUP: Count=7, Max=3, Min=1, Avg=1.86, 90=2, 99=3, 99.9=3, 99.99=3]

so as you can see read perfromance in general high but , because presence of global lock on cache eviction we have periodical slow down. This problem may be fixed by given change.

andrii0lomakin commented 5 years ago

Implemented in 3.0.15 and 3.1 versions

ben-manes commented 5 years ago

Oh cool. Can you point me to the code? Curious to see what you built.

andrii0lomakin commented 5 years ago

@ben-manes Yes, sure. https://github.com/orientechnologies/orientdb/blob/develop/core/src/main/java/com/orientechnologies/orient/core/storage/cache/chm/AsyncReadCache.java .

ben-manes commented 5 years ago

Thanks! Glancing through it, a few ideas that you might find interesting.

You are using a 3 state drain status, like I did for a long time. I eventually moved to a 4 state to improve scheduling (see https://github.com/ben-manes/caffeine/issues/57)
You have a large window (eden) at 20%, probably because you observed perf issues at 1% being too LFU biased. We have a new paper that solves this (free ACM link in Caffeine's readme) where no static configuration is ideal. We explored adaptive approaches and in this branch I have hill climbing integrated. That very quickly optimizes towards the best configuration, making the policy robust to any workload we throw at it.
There is a HashDoS attack vector in TinyLFU. If we exploit hash collisions, one could artificially raise the victim's frequency so that no items are admitted into the main space. We solve that in Caffeine by introducing a small amount of jitter.

Otherwise looks great :)

andrii0lomakin commented 5 years ago

@ben-manes . Super cool. Thank you very much for such feedback.

andrii0lomakin commented 5 years ago

Guys I will reopen issue to incorporate further improvements.

ben-manes commented 5 years ago

Here's some fun examples of the adaptivity correcting the configuration. Corda is recency-skewed and starts with a small window (LFU). Loop is frequency-skewed and starts with a large window (LRU).

orientechnologies / orientdb-labs

[OEP 8] Asynchronous read cache with state machine #8