[kfence] Benchmark implementations

melver commented 4 years ago

Before ruling out the naive implementation that modifies the fast-path, let's benchmark it, and make a decision based on data. We'll likely need this data one way or another.

Some benchmarks to run:

sysbench
https://github.com/gormanm/mmtests

melver commented 4 years ago

Testing fast-path overhead with naive version (per-CPU counter and conditional branch).

Parameters:

Using defconfig.
The naive version was booted with kfence.sample_rate=2147483647 to avoid entering KFENCE.
GCC 10.1.0.
Virtual machine with 8 CPUs, 32G memory.

Running

sysbench --test=fileio --file-total-size=1G --file-num=1024 prepare
ulimit -n 100000
sysbench --num-threads=8 --test=fileio --file-total-size=1G --file-num=1024 --file-test-mode=rndrw --max-time=300 --max-requests=0 run

With that we get:

KFENCE:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8

Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.

Operations performed:  1411735 Read, 941160 Write, 23863572 Other = 26216467 Total
Read 21.541Gb  Written 14.361Gb  Total transferred 35.902Gb  (122.55Mb/sec)
 7842.98 Requests/sec executed

Test execution summary:
    total time:                          300.0002s
    total number of events:              2352895
    total time taken by event execution: 95.0336
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.04ms
         max:                                  6.47ms
         approx.  95 percentile:               0.10ms

Threads fairness:
    events (avg/stddev):           294111.8750/1356.42
    execution time (avg/stddev):   11.8792/0.06

Baseline:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8

Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.

Operations performed:  1455711 Read, 970478 Write, 24606888 Other = 27033077 Total
Read 22.212Gb  Written 14.808Gb  Total transferred 37.021Gb  (126.36Mb/sec)
 8087.29 Requests/sec executed

Test execution summary:
    total time:                          300.0002s
    total number of events:              2426189
    total time taken by event execution: 98.9264
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.04ms
         max:                                  6.66ms
         approx.  95 percentile:               0.10ms

Threads fairness:
    events (avg/stddev):           303273.6250/1345.12
    execution time (avg/stddev):   12.3658/0.03

Conclusion: throughput reduction of 3% as-is with the naive version.

melver commented 4 years ago

Results using static keys, with sample rate of 100ms:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 8

Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.

Operations performed:  1452793 Read, 968532 Write, 24523965 Other = 26945290 Total
Read 22.168Gb  Written 14.779Gb  Total transferred 36.946Gb  (126.11Mb/sec)
 8071.08 Requests/sec executed

Test execution summary:
    total time:                          300.0002s
    total number of events:              2421325
    total time taken by event execution: 99.5015
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.04ms
         max:                                  7.43ms
         approx.  95 percentile:               0.10ms

Threads fairness:
    events (avg/stddev):           302665.6250/1652.29
    execution time (avg/stddev):   12.4377/0.05

Conclusion: slowdown is negligible.

ramosian-glider commented 4 years ago

What I saw with Android benchmarking was that results of several benchmark runs were distributed around a single mean if I didn't reboot the device, but there was a different mean if I repeated the measurements after a reboot. So we'll probably need to perform the measurements multiple times with reboots between them before reporting the performance along with upstream patches.

melver commented 4 years ago

Experimental Setup

Based on defconfig.
GCC 10.1.0.
KVM-based virtual machine with 8 CPUs, 32G memory.

Benchmark

sysbench --test=fileio --file-total-size=1G --file-num=1024 prepare
ulimit -n 100000
sysbench --num-threads=8 --test=fileio --file-total-size=1G --file-num=1024 --file-test-mode=rndrw --max-time=60 --max-requests=0 run

Benchmark is run 5 times; each run is done with a freshly booted kernel, and we take the average of the runs.

Results

Baseline (no KFENCE):

[  1] 8151.29 Requests/sec
[  2] 8410.38 Requests/sec
[  3] 7906.54 Requests/sec
[  4] 7964.18 Requests/sec
[  5] 7944.54 Requests/sec
[med] 7964.18 Requests/sec

KFENCE (static keys, skip guarded allocations, kfence.sample_rate=100):

[  1] 8001.17 Requests/sec
[  2] 7744.11 Requests/sec
[  3] 7885.21 Requests/sec
[  4] 7856.12 Requests/sec
[  5] 7872.87 Requests/sec
[med] 7872.87 Requests/sec

KFENCE (static keys, with guarded allocations, kfence.sample_rate=100):

[  1] 7940.12 Requests/sec
[  2] 7869.42 Requests/sec
[  3] 7814.59 Requests/sec
[  4] 7986.49 Requests/sec
[  5] 7917.77 Requests/sec
[med] 7917.77 Requests/sec

We observe a performance degradation of ~0.58%.

dvyukov commented 4 years ago

Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?

But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.

melver commented 4 years ago

Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?

I think there is a high variance, and 5 samples might not be enough.

But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.

I've noticed that this doesn't change much e.g. with 1000ms (instead of 100ms). I think that's because we still have some overhead elsewhere that isn't due to static branch switching. We currently think it's the kfree fast-path.

ramosian-glider commented 4 years ago

By the way, will you have a chance to benchmark the stealing impl in its current state? I am curious whether this approach was viable or not.

On Thu, Jul 9, 2020, 18:47 Marco Elver notifications@github.com wrote:

Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?

I think there is a high variance, and 5 samples might not be enough.

But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.

I've noticed that this doesn't change much e.g. with 1000ms (instead of 100ms). I think that's because we still have some overhead elsewhere that isn't due to static branch switching. We currently think it's the kfree fast-path.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/kasan/issues/72#issuecomment-656235495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG6Z44AEUX6A2KNNDJMZE3R2XYA7ANCNFSM4M64L2EQ .

melver commented 4 years ago

By the way, will you have a chance to benchmark the stealing impl in its current state?

KFENCE (stealing, kfence.sample_rate=100):

[  1] 7815.89 Requests/sec
[  2] 7776.86 Requests/sec
[  3] 7708.77 Requests/sec
[  4] 7702.80 Requests/sec
[  5] 7787.02 Requests/sec
[med] 7776.86 Requests/sec

~1.8% slower than static keys from this.

melver commented 4 years ago

After kfree() optimization in https://github.com/google/kasan/commit/18967a22c724fdf54489997ecc2e2d7a6aa4cd9e

KFENCE (static keys, with guarded allocations, kfence.sample_rate=100):

[  1] 7999.14 Requests/sec
[  2] 7952.49 Requests/sec
[  3] 8045.04 Requests/sec
[  4] 7976.90 Requests/sec
[  5] 8141.59 Requests/sec
[med] 7999.14 Requests/sec

=> Matches very closely to the baseline without KFENCE (the small difference can be assumed to be due to variance).

melver commented 4 years ago

We need to re-benchmark the final version anyway, so the results here won't matter much.

However, for the sake of this issue, we have benchmarked the implementations.

google / kernel-sanitizers