tikv / raft-engine

A persistent storage engine for Multi-Raft log
Apache License 2.0
565 stars 88 forks source link

support memory swap out #211

Closed tabokie closed 2 years ago

tabokie commented 2 years ago

Signed-off-by: tabokie xy.tao@outlook.com

Ref #205

An allocator that switches to file-based mmap when memory usage is above certain watermark.

Micro-benchmarks:

test swappy_allocator::tests::bench_allocator_fast_path     ... bench:         295 ns/iter (+/- 165)
test swappy_allocator::tests::bench_allocator_slow_path     ... bench:      92,075 ns/iter (+/- 12,995)
test swappy_allocator::tests::bench_allocator_std_global    ... bench:         252 ns/iter (+/- 41)

buffered write (./target/release/stress --path /data3/re_bench --time 120 --regions 1000 --write-threads 1 --write-without-sync):

master:
Throughput(QPS) = 17314.75
Latency(μs) min = 19, avg = 41.10, p50 = 31, p90 = 49, p95 = 57, p99 = 81, p99.9 = 2289, max = 5151
Fairness = 100.0%
Write Bandwidth = 181.3MiB/s

swap memory-limit=+inf:
Throughput(QPS) = 17458.37
Latency(μs) min = 18, avg = 40.93, p50 = 31, p90 = 48, p95 = 56, p99 = 79, p99.9 = 2241, max = 5663
Fairness = 100.0%
Write Bandwidth = 182.8MiB/s

swap memory-limit=0:
Throughput(QPS) = 11312.92
Latency(μs) min = 17, avg = 71.75, p50 = 40, p90 = 64, p95 = 77, p99 = 133, p99.9 = 7819, max = 50303
Fairness = 100.0%
Write Bandwidth = 118.4MiB/s

write (./target/release/stress --path /data3/re_bench --time 120 --regions 1000 --write-threads 1 --purge-interval 0 --compact-count 0):

master:
Throughput(QPS) = 7010.16
Latency(μs) min = 77, avg = 123.41, p50 = 117, p90 = 153, p95 = 166, p99 = 201, p99.9 = 991, max = 4057
Fairness = 100.0%
Write Bandwidth = 73.4MiB/s

swap memory-limit=0:
Throughput(QPS) = 4748.54
Latency(μs) min = 85, avg = 191.08, p50 = 169, p90 = 256, p95 = 291, p99 = 520, p99.9 = 2339, max = 29887
Fairness = 100.0%
Write Bandwidth = 49.7MiB/s

swap memory-limit=1GB:
Throughput(QPS) = 6270.32
Latency(μs) min = 80, avg = 140.46, p50 = 124, p90 = 178, p95 = 201, p99 = 266, p99.9 = 2743, max = 23407
Fairness = 100.0%
Write Bandwidth = 65.6MiB/s

Test results on nvme disk:

codecov[bot] commented 2 years ago

Codecov Report

Merging #211 (76616d3) into master (563a60d) will increase coverage by 0.29%. The diff coverage is 99.34%.

@@            Coverage Diff             @@
##           master     #211      +/-   ##
==========================================
+ Coverage   96.95%   97.25%   +0.29%     
==========================================
  Files          28       29       +1     
  Lines        7724     8741    +1017     
==========================================
+ Hits         7489     8501    +1012     
- Misses        235      240       +5     
Impacted Files Coverage Δ
src/lib.rs 100.00% <ø> (ø)
tests/failpoints/mod.rs 100.00% <ø> (ø)
src/swappy_allocator.rs 99.23% <99.23%> (ø)
src/config.rs 96.49% <100.00%> (+0.37%) :arrow_up:
src/engine.rs 97.21% <100.00%> (ø)
src/memtable.rs 99.15% <100.00%> (+0.03%) :arrow_up:
src/metrics.rs 97.65% <100.00%> (+0.07%) :arrow_up:
src/purge.rs 97.43% <100.00%> (ø)
src/util.rs 88.85% <100.00%> (+0.86%) :arrow_up:
tests/failpoints/test_io_error.rs 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 563a60d...76616d3. Read the comment docs.

BusyJay commented 2 years ago

Security is not considered in this PR. All writes to disk should be encrypted if encryption is enabled.

tabokie commented 2 years ago

Security is not considered in this PR. All writes to disk should be encrypted if encryption is enabled.

I have thought about that. And I think it's not a problem because entry index only contains file offsets but no user data.

BusyJay commented 2 years ago

I built a POC of offloading partial entries instead of whole to disk. Compared to this PR, it has the following advantages:

  1. Page management is very simple, it gets rid of all the complexity brought by allocator;
  2. It's expected to offload applied entries only, so it can optimize under the assumption that offloaded entries will not be changed (rewritten). Though not all optimizations are implemented yet.
  3. Accessing active entries like uncommitted entries or unapplied entries won't trigger disk access, so it won't bring extra IO in the raftstore thread in most cases, even there is a large log lag.
  4. It takes advantage of Posix file system that a process can still write to file after it's deleted, so it's slightly safer.
  5. It compiles using stable compiler and is configurable.

It has one disadvantage though: it can't give a precise control on memory. However, because it offloads all applied entry indexes to disk, so there should be a limited index in memory. It can also choose to offload all committed entry indexes, which will only need to keep at most 100MiB in memory.

tabokie commented 2 years ago

I built a POC of offloading partial entries instead of whole to disk. Compared to this PR, it has the following advantages:

There are three weeks before code frozen. I don't trust the quality of your bloated data structure can be well tested. My allocator, due to its inherent simplicity (both in design and in interaction with other components), is already well tested (1000 LOC of tests + all existing tests).

I agree some of your points in general, but:

  1. Your code is not simple. I think anyone can see that.
  2. The assumption is wrong. As I pointed out in previous comment, most read IOs are caused by rewrite process, and hit primarily cold entries (applied entries). And your code will be even more complex to support that.

The points that I do agree, are also not that important. Performance (especially read) under memory throttling has never been a priority. Yes, better performance is good to have, only if I'm comfortable with the cost. But looking at your code, I'm not comfortable.

BusyJay commented 2 years ago
1. Your code is **not** simple. I think anyone can see that.

I don't see why the code is not simple. The most complexity comes from making BacklogIndex and in memory index behave as a unified interface. And these part can be well tested by unit tests.

As I pointed out in previous comment, most read IOs are caused by rewrite process, and hit primarily cold entries (applied entries). And your code will be even more complex to support that.

So why applied entries can be rewritten? Applied means it's immutable, any rewrite is wrong.

But looking at your code, I'm not comfortable.

The same feeling I have about allocator. I strongly object to use an allocator.

tabokie commented 2 years ago

I don't see why the code is not simple.

You said it yourself. "behave as a unified interface", which echoes my comment emphasizing that implementing our own VecDeque is complex. It's not the interface that is complex of course, it's the internal states and branches that are complex, and this complexity cannot be hidden. Last time I spent half month of intensive working to reason and cover nearly every conditional branch in this codebase.

Allocator is simple, because it doesn't tamper with any of the data structure and their interactions. Its methods adhere to a strict calling protocol. The combination of method calling is finite.

So why applied entries can be rewritten? Applied means it's immutable, any rewrite is wrong.

Rewrite logic hasn't changed much since I took over this repo. And I suspect any other implementations can do away with compactions.

The same feeling I have about allocator. I strongly object to use an allocator.

I guess we can't persuade anyone with more discussions. You can put it to a vote if necessary.

BusyJay commented 2 years ago

Rewrite logic hasn't changed much since I took over this repo. And I suspect any other implementations can do away with compactions.

Now I see your point. Instead of introducing backlog index, index may need to be tracked base on file instead of region. That is, region maintains a list of log files that contain its entry, and every log file has a region indexes queue. So that rewrite is just updating log file list. Though it's certainly a big refactor.

tabokie commented 2 years ago

Now I see your point. Instead of introducing backlog index, index may need to be tracked base on file instead of region. That is, region maintains a list of log files that contain its entry, and every log file has a region indexes queue. So that rewrite is just updating log file list. Though it's certainly a big refactor.

So a more LSM-like approach. With a proper block cache it might work better than the current implementation. And it will cut a lot of complexity from memtable to make it easier to maintain. I'll file an issue later.