ruuda / claxon

A FLAC decoder in Rust
Apache License 2.0
287 stars 28 forks source link

Attempt safe and fast buffer allocation #14

Closed Shnatsel closed 6 years ago

Shnatsel commented 6 years ago

This should be the fastest way to safely initialize an intermediate buffer. vec![value; lenght] desugars into Vec::from_elem() which has a fast path for value 0 that requests zeroed memory from the OS, and the OS zeroes memory on demand. Addresses #13.

I have converted the benchmarking suite to Criterion to get more accurate benchmarks, specifically so that I would be able to reliably detect <2% changes in performance which is impossible with regular cargo-bench. It also has an added bonus of working on stable Rust. I can open a PR for that if you'd like to have that in master.

According to the testsamples.rs benchmark, as well as measuring execution time of the bench_decode binary with hyperfine, this change makes no difference in terms of performance. I have tested this both on rustc 1.28.0 (9634041f0 2018-07-30) and rustc 1.30.0-nightly (d767ee116 2018-08-15) with identical results.

However, the performance difference between Claxon master and libflac 1.3.1 is 1.44, not 1.13 as advertised in README.md; it is possible that on my machine the execution is bottlenecked elsewhere, and I would not notice performance degradation caused by this PR. So please run this through your benchmarking setup before merging.

ruuda commented 6 years ago

Sorry about the confusing benchmarks, benches/testsamples.rs is a remnant that I should remove. There are tools/benchmark.sh to run the benchmark and tools/compare_benches.r to compare two runs.

Usage is as follows:

# Put a few flac files in testsamples/extra.
# I use five files, all 44100 Hz, 16 bit, stereo (CD quality), together 193 MiB.
git checkout master
tools/benchmark.sh before
git checkout alternative
tools/benchmark.sh after
tools/compare_benches.r before_*_all.dat after_*_all.dat

That will print a table like this:

metric value after (mean and stddev) fraction of before
p10 (0.1 quantile time per sample) 18.1 ± 0.2 ns 1.000 ± 0.016
p50 (median time per sample) 18.9 ± 0.3 ns 1.000 ± 0.019
p90 (0.9 quantile time per sample) 20.0 ± 0.4 ns 0.996 ± 0.040
μ (mean time per sample) 19.2 ± 0.3 ns 0.999 ± 0.020
τ (input bytes throughput) 66.7 ± 3.4 MiB/s 1.001 ± 0.074

Depending on the kind of input file, this change is indeed not going to make a difference, because the buffer is recycled and allocated with the right size immediately. To test the worst case impact, you can replace this line to always pass in a clean Vec::new(), which will be too small, triggering ensure_buffer_len to allocate a bigger one on every block.

However, the performance difference between Claxon master and libflac 1.3.1 is 1.44, not 1.13 as advertised in README.md; it is possible that on my machine the execution is bottlenecked elsewhere, and I would not notice performance degradation caused by this PR. So please run this through your benchmarking setup before merging.

This is interesting ... when I run Claxon again now on the same system, I get about 1.4 times libflac too, even on Claxon 0.3.0. One thing that always stood out was that although Claxon was close to libflac in running time, it was executing many more instructions, and indeed on my Raspberry Pi it was not within a factor 2 of libflac if I recall correctly. So if Claxon was almost as fast as libflac, it was through the CPU being able to deal with those extra instructions, not because it did as little work as libflac. It might be that we are seeing the effects of Meltdown/Spectre/Foreshadow mitigations here.

Shnatsel commented 6 years ago

It might be that we are seeing the effects of Meltdown/Spectre/Foreshadow mitigations here.

I'd say that a rustc performance regression is a lot more realistic, e.g. some loops not being vectorized in latest versions even though they used to be vectorized in earlier ones.

I've tried to use tools/benchmark.sh but I couldn't get it to work, and didn't want to get into debugging it just yet.

ruuda commented 6 years ago

I'd say that a rustc performance regression is a lot more realistic

This is the cause; I just ran it on Rust 1.13.0 and it's consistent with the value I measured before (I got 1.16 ± 0.09 now). I’ll dig a bit deeper and file a bug upstream.

Shnatsel commented 6 years ago

On rustc 1.29.0 (aa3ca1994 2018-09-11) I get 1.51x time compared to system libflac (Ubuntu 16.04, package version 1.3.1-4) with -C codegen-units=1 and a staggering 1.77x without it. It appears that another rustc regression has hit Claxon.

Shnatsel commented 6 years ago

Actually, scratch that. I get 1.51x on 1.28 too, and even more on 1.27; it's back to 1.51x on 1.25. So probably not a regression.

Which files did you use for the comparison that gave 1.10x?

ruuda commented 6 years ago

Which files did you use for the comparison that gave 1.10x

I use five files from my personal library; unfortunately these cannot be shared, but the ones from archive.org should give similar results. It might depend a lot on your CPU though. I am using a Skylake i7-6700HQ.

The Rust issue here contains a bit more information: https://github.com/rust-lang/rust/issues/53833.

Shnatsel commented 6 years ago

I'm testing on an AMD Vishera CPU. That might be the reason why I'm getting different results.

Shnatsel commented 6 years ago

I've attempted to rewrite the buffer allocation using appends to Vec so that memory zeroing would not be needed at all. But apparently I've messed up some math somewhere and the output is no longer correct, and without unit tests on decode_* functions it's not trivial to isolate or debug. So I've fixed this branch instead.

For reference, the version with appending to Vec can be found here: https://github.com/Shnatsel/claxon/tree/safe-and-sliceless

ruuda commented 6 years ago

I did some benchmarking. The performance impact is slight but significant. In the typical case, it leads to a 0.03% increase in decode time, for an adversarial decoding setup it can be as much as 0.2%. I think that's small enough to be worth the effort.

Implemented myself in https://github.com/ruuda/claxon/commit/fec24678c7086aa4b2528bd7542f31d978b81b90, see also the commit message for full benchmark details.

Thanks for getting this discussion started!