Closed melver closed 4 years ago
Testing fast-path overhead with naive version (per-CPU counter and conditional branch).
Parameters:
Running
sysbench --test=fileio --file-total-size=1G --file-num=1024 prepare
ulimit -n 100000
sysbench --num-threads=8 --test=fileio --file-total-size=1G --file-num=1024 --file-test-mode=rndrw --max-time=300 --max-requests=0 run
With that we get:
KFENCE:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.
Operations performed: 1411735 Read, 941160 Write, 23863572 Other = 26216467 Total
Read 21.541Gb Written 14.361Gb Total transferred 35.902Gb (122.55Mb/sec)
7842.98 Requests/sec executed
Test execution summary:
total time: 300.0002s
total number of events: 2352895
total time taken by event execution: 95.0336
per-request statistics:
min: 0.00ms
avg: 0.04ms
max: 6.47ms
approx. 95 percentile: 0.10ms
Threads fairness:
events (avg/stddev): 294111.8750/1356.42
execution time (avg/stddev): 11.8792/0.06
Baseline:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.
Operations performed: 1455711 Read, 970478 Write, 24606888 Other = 27033077 Total
Read 22.212Gb Written 14.808Gb Total transferred 37.021Gb (126.36Mb/sec)
8087.29 Requests/sec executed
Test execution summary:
total time: 300.0002s
total number of events: 2426189
total time taken by event execution: 98.9264
per-request statistics:
min: 0.00ms
avg: 0.04ms
max: 6.66ms
approx. 95 percentile: 0.10ms
Threads fairness:
events (avg/stddev): 303273.6250/1345.12
execution time (avg/stddev): 12.3658/0.03
Conclusion: throughput reduction of 3% as-is with the naive version.
Results using static keys, with sample rate of 100ms:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Extra file open flags: 0
1024 files, 1Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Threads started!
Time limit exceeded, exiting...
(last message repeated 7 times)
Done.
Operations performed: 1452793 Read, 968532 Write, 24523965 Other = 26945290 Total
Read 22.168Gb Written 14.779Gb Total transferred 36.946Gb (126.11Mb/sec)
8071.08 Requests/sec executed
Test execution summary:
total time: 300.0002s
total number of events: 2421325
total time taken by event execution: 99.5015
per-request statistics:
min: 0.00ms
avg: 0.04ms
max: 7.43ms
approx. 95 percentile: 0.10ms
Threads fairness:
events (avg/stddev): 302665.6250/1652.29
execution time (avg/stddev): 12.4377/0.05
Conclusion: slowdown is negligible.
What I saw with Android benchmarking was that results of several benchmark runs were distributed around a single mean if I didn't reboot the device, but there was a different mean if I repeated the measurements after a reboot. So we'll probably need to perform the measurements multiple times with reboots between them before reporting the performance along with upstream patches.
sysbench --test=fileio --file-total-size=1G --file-num=1024 prepare
ulimit -n 100000
sysbench --num-threads=8 --test=fileio --file-total-size=1G --file-num=1024 --file-test-mode=rndrw --max-time=60 --max-requests=0 run
Baseline (no KFENCE):
[ 1] 8151.29 Requests/sec
[ 2] 8410.38 Requests/sec
[ 3] 7906.54 Requests/sec
[ 4] 7964.18 Requests/sec
[ 5] 7944.54 Requests/sec
[med] 7964.18 Requests/sec
KFENCE (static keys, skip guarded allocations, kfence.sample_rate=100):
[ 1] 8001.17 Requests/sec
[ 2] 7744.11 Requests/sec
[ 3] 7885.21 Requests/sec
[ 4] 7856.12 Requests/sec
[ 5] 7872.87 Requests/sec
[med] 7872.87 Requests/sec
KFENCE (static keys, with guarded allocations, kfence.sample_rate=100):
[ 1] 7940.12 Requests/sec
[ 2] 7869.42 Requests/sec
[ 3] 7814.59 Requests/sec
[ 4] 7986.49 Requests/sec
[ 5] 7917.77 Requests/sec
[med] 7917.77 Requests/sec
We observe a performance degradation of ~0.58%.
Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?
But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.
Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?
I think there is a high variance, and 5 samples might not be enough.
But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.
I've noticed that this doesn't change much e.g. with 1000ms (instead of 100ms). I think that's because we still have some overhead elsewhere that isn't due to static branch switching. We currently think it's the kfree fast-path.
By the way, will you have a chance to benchmark the stealing impl in its current state? I am curious whether this approach was viable or not.
On Thu, Jul 9, 2020, 18:47 Marco Elver notifications@github.com wrote:
Why is "skip guarded allocations" slower than "with guarded allocations"? I would assume allocating protected pages should slow down things. What am I missing?
I think there is a high variance, and 5 samples might not be enough.
But it's good we can sense the overhead with the benchmark. Please estimate at what rate the overhead becomes negligible. I think in real life we can do 100-1000x lower sampling rate, I think it's important to be able to say that above X rate, there is no noticeable overhead.
I've noticed that this doesn't change much e.g. with 1000ms (instead of 100ms). I think that's because we still have some overhead elsewhere that isn't due to static branch switching. We currently think it's the kfree fast-path.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/kasan/issues/72#issuecomment-656235495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG6Z44AEUX6A2KNNDJMZE3R2XYA7ANCNFSM4M64L2EQ .
By the way, will you have a chance to benchmark the stealing impl in its current state?
KFENCE (stealing, kfence.sample_rate=100):
[ 1] 7815.89 Requests/sec
[ 2] 7776.86 Requests/sec
[ 3] 7708.77 Requests/sec
[ 4] 7702.80 Requests/sec
[ 5] 7787.02 Requests/sec
[med] 7776.86 Requests/sec
~1.8% slower than static keys from this.
After kfree() optimization in https://github.com/google/kasan/commit/18967a22c724fdf54489997ecc2e2d7a6aa4cd9e
KFENCE (static keys, with guarded allocations, kfence.sample_rate=100):
[ 1] 7999.14 Requests/sec
[ 2] 7952.49 Requests/sec
[ 3] 8045.04 Requests/sec
[ 4] 7976.90 Requests/sec
[ 5] 8141.59 Requests/sec
[med] 7999.14 Requests/sec
=> Matches very closely to the baseline without KFENCE (the small difference can be assumed to be due to variance).
We need to re-benchmark the final version anyway, so the results here won't matter much.
However, for the sake of this issue, we have benchmarked the implementations.
Before ruling out the naive implementation that modifies the fast-path, let's benchmark it, and make a decision based on data. We'll likely need this data one way or another.
Some benchmarks to run: