Open landaire opened 5 months ago
I failed to recognize that the slow path exists for scenarios where buf
is not contiguous. The conditions though in decode_varint()
are what really kill perf. In fact, making it an if {} else {}
with the current unrolled loop is better than my rewrite.
That's really cool! I reproduced your result on the same machine as previous benchmarks.
$ cargo bench -- decode/pooled_write_request
Finished bench [optimized] target(s) in 0.06s
Running unittests src/lib.rs (target/release/deps/bench_prom-4775082acf9e7701)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 0.00s
Running benches/prom_decode.rs (target/release/deps/prom_decode-6ac74029ce805ccf)
Gnuplot not found, using plotters backend
Benchmarking decode/pooled_write_request: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, enable flat sampling, or reduce sample count to 60.
decode/pooled_write_request
time: [1.0400 ms 1.0411 ms 1.0423 ms]
change: [-0.3906% -0.1422% +0.0883%] (p = 0.25 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
I failed to recognize that the slow path exists for scenarios where buf is not contiguous.
Is these changes you made to prost still correct?
Is these changes you made to prost still correct?
No, unfortunately not -- at least not for the general case. For the purposes of your benchmark, or if you are 100% certain you will never have a fragmented Buf
it's correct. If for some reason your input buffer is not contiguous it will fail when reading a varint
that crosses two chunks which are discontinous.
I'm trying to figure out how the this if
condition could be rewritten to result in faster code (I.e. pleasing the branch predictor): https://github.com/tokio-rs/prost/blob/e3deaa200b3a5500bf0403325d02716973b7296a/src/encoding.rs#L54-L63
I found that rewriting it in a few different ways, including adding an extra condition, seems to speed up the prost
benchmarks but not the Greptime benchmark. For example:
if len >= 10 || len == rem {
let (value, advance) = decode_varint_slice(bytes)?;
buf.advance(advance);
Ok(value)
} else {
decode_varint_slow(buf)
}
}
This yields the following results in prost
benchmarks:
varint/small/decode time: [32.028 ns 32.147 ns 32.279 ns]
change: [-84.245% -84.141% -84.048%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
varint/medium/decode time: [154.64 ns 155.26 ns 155.99 ns]
change: [-29.224% -28.274% -27.316%] (p = 0.00 < 0.05)
Performance has improved.
varint/large/decode time: [307.91 ns 309.42 ns 311.17 ns]
change: [-9.2315% -8.4735% -7.5222%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
varint/mixed/decode time: [224.22 ns 228.21 ns 233.62 ns]
change: [-17.977% -16.333% -13.952%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
But again, no change in Greptime benchmarks.
I think this certainly highlights though that there are additional tradeoffs that were made, like accounting for discontinuous memory, that are not present in the Go version of protobuf
.
This seems like the perfect case for specialization if it's ever stabilized...
My latest change (https://github.com/tokio-rs/prost/commit/7c6da11d342c766e1677ed9e48c3a5f6b5cc5bb5) on the same branch handles non-contiguous memory and for me saw a 19.318% performance improvement:
decode/pooled_write_request
time: [1.3654 ms 1.3715 ms 1.3807 ms]
change: [-19.858% -19.318% -18.595%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
Curiously though prost's benchmarks still show otherwise overall:
varint/small/decode time: [97.933 ns 98.461 ns 99.137 ns]
change: [-53.961% -53.067% -52.354%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) high mild
11 (11.00%) high severe
varint/medium/decode time: [273.87 ns 276.03 ns 278.71 ns]
change: [+22.922% +24.307% +25.636%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) high mild
1 (1.00%) high severe
varint/large/decode time: [392.60 ns 400.02 ns 410.31 ns]
change: [+15.371% +20.303% +27.681%] (p = 0.00 < 0.05)
Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
varint/mixed/decode time: [284.02 ns 284.36 ns 284.81 ns]
change: [+3.9519% +4.4581% +4.9251%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
This PR should not be merged.
I was reading the article yesterday and made some changes that allowed me to see a lot of time being spent in
decode_varint
:This is with using the following patch (prost is the latest
master
commit):This removes any noise from reallocating vectors, but is kind of cheating for this scenario and should only serve to reduce noise.
The two biggest areas for improvement now are in the
RepeatedField<T>::push_default()
andprost::encoding::decode_varint()
functions. I decided to look at the latter and noted a couple things (reference: https://github.com/tokio-rs/prost/blob/e3deaa200b3a5500bf0403325d02716973b7296a/src/encoding.rs#L57)> 10
or the last byte of the buffer is< 0x80
.Point 1 is fine
Point 3 I think is fine
Point 2 I think has some problems:
decode_varint()
we check the same conditions thatdecode_varint_slice()
asserts on, and even thoughdecode_varint_slice()
is marked as#[inline]
I don't think the optimizer will remove these duplicate checks since the behavior of the conditions is different. Indecode_varint()
failing the checks will result indecode_varint_slow()
being called and failing indecode_varint_slice()
results in an assertion failure. I did not verify this by looking at the assembly.decode_varint()
has two conditions in its main body: the byte value is< 0x80
or< 2
(checking for presence of another byte and overflow, respectively). I'm not sure if this makes a practical impact, but if we can eliminate both of these conditions or front-load them we may be able to speed things up.I took a stab at rewriting this logic using safe Rust and iterators here: https://github.com/tokio-rs/prost/compare/master...landaire:prost:varint_slice_perf
Assembly comparison: original, modified.
My solution looked somewhat like the slow path and handles all cases of both slow/fast path, so I also removed the slow path code. All prost tests pass.
For this benchmark it resulted in a 37% perf increase:
For the prost benchmarks the numbers are a little iffy:
And viewing the performance of the benchmark under a flamegraph shows that we no longer even see
decode_varint
in the sampling!I would ask if you could try out my branch to see if you can reproduce these results. The PR here is just for the sake of showing using my branch of
prost
as a patched dependency.