Use fork of prost with changes to decode_varint_slice

This PR should not be merged.

I was reading the article yesterday and made some changes that allowed me to see a lot of time being spent in decode_varint:

CleanShot 2024-04-16 at 10 04 43@2x

This is with using the following patch (prost is the latest master commit):

diff --git a/Cargo.lock b/Cargo.lock
index 91cbb1d..b5925bb 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -833,7 +833,6 @@ dependencies = [
 [[package]]
 name = "prost"
 version = "0.12.4"
-source = "git+https://github.com/landaire/prost?branch=varint_slice_perf#5b9cce67e7f46dfe8f9d64b97b21d6ba9d8883b1"
 dependencies = [
  "bytes",
  "prost-derive",
@@ -863,10 +862,9 @@ dependencies = [
 [[package]]
 name = "prost-derive"
 version = "0.12.4"
-source = "git+https://github.com/landaire/prost?branch=varint_slice_perf#5b9cce67e7f46dfe8f9d64b97b21d6ba9d8883b1"
 dependencies = [
  "anyhow",
- "itertools 0.12.1",
+ "itertools 0.10.5",
  "proc-macro2",
  "quote",
  "syn",
diff --git a/Cargo.toml b/Cargo.toml
index 297e370..4d79a8e 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -18,7 +18,7 @@ prost = "0.12"
 criterion = "0.5"

 [patch.crates-io]
-prost = { git = "https://github.com/landaire/prost", branch = "varint_slice_perf" }
+prost = { path = "../prost" }

 [[bench]]
 name = "prom_decode"
diff --git a/src/prom_write_request.rs b/src/prom_write_request.rs
index 55e1bf2..5db4ec8 100644
--- a/src/prom_write_request.rs
+++ b/src/prom_write_request.rs
@@ -163,9 +163,11 @@ impl TimeSeries {
 }
 impl ::core::default::Default for TimeSeries {
     fn default() -> Self {
+        let labels = RepeatedField::from_vec(Vec::with_capacity(8));
+        let samples = RepeatedField::from_vec(Vec::with_capacity(8));
         TimeSeries {
-            labels: ::core::default::Default::default(),
-            samples: ::core::default::Default::default(),
+            labels,
+            samples,
         }
     }
 }
@@ -236,8 +238,9 @@ impl WriteRequest {

 impl Default for WriteRequest {
     fn default() -> Self {
+        let timeseries = RepeatedField::from_vec(Vec::with_capacity(2048));
         WriteRequest {
-            timeseries: Default::default(),
+            timeseries,
         }
     }
 }

This removes any noise from reallocating vectors, but is kind of cheating for this scenario and should only serve to reduce noise.

The two biggest areas for improvement now are in the RepeatedField<T>::push_default() and prost::encoding::decode_varint() functions. I decided to look at the latter and noted a couple things (reference: https://github.com/tokio-rs/prost/blob/e3deaa200b3a5500bf0403325d02716973b7296a/src/encoding.rs#L57)

There's a fast path for 1-byte varints
There's an optimized path that has a pre-unrolled loop for undetermined-length varints when the total length of the buffer is > 10 or the last byte of the buffer is < 0x80.
There's a "slow" path that uses an iterative solution

Point 1 is fine

Point 3 I think is fine

Point 2 I think has some problems:

In decode_varint() we check the same conditions that decode_varint_slice() asserts on, and even though decode_varint_slice() is marked as #[inline] I don't think the optimizer will remove these duplicate checks since the behavior of the conditions is different. In decode_varint() failing the checks will result in decode_varint_slow() being called and failing in decode_varint_slice() results in an assertion failure. I did not verify this by looking at the assembly.
- Edit: It does indeed get optimized away (see LBB0_2, LBB0_3): https://rust.godbolt.org/z/36Tob4j6o
decode_varint() has two conditions in its main body: the byte value is < 0x80 or < 2 (checking for presence of another byte and overflow, respectively). I'm not sure if this makes a practical impact, but if we can eliminate both of these conditions or front-load them we may be able to speed things up.

I took a stab at rewriting this logic using safe Rust and iterators here: https://github.com/tokio-rs/prost/compare/master...landaire:prost:varint_slice_perf

Assembly comparison: original, modified.

My solution looked somewhat like the slow path and handles all cases of both slow/fast path, so I also removed the slow path code. All prost tests pass.

For this benchmark it resulted in a 37% perf increase:

decode/pooled_write_request
                        time:   [1.0516 ms 1.0549 ms 1.0589 ms]
                        change: [-37.862% -37.499% -37.163%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

For the prost benchmarks the numbers are a little iffy:

❯ cargo bench -- decode
   Compiling prost v0.12.4 (/private/tmp/prost)
warning: function `decode_varint_slow` is never used
   --> src/encoding.rs:117:4
    |
117 | fn decode_varint_slow<B>(buf: &mut B) -> Result<u64, DecodeError>
    |    ^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: `prost` (lib) generated 1 warning
    Finished bench [optimized + debuginfo] target(s) in 1.55s
     Running benches/varint.rs (target/release/deps/varint-5c567e9ac0b3a83e)
varint/small/decode     time:   [31.219 ns 31.280 ns 31.353 ns]
                        change: [-84.681% -84.537% -84.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  9 (9.00%) high severe

varint/medium/decode    time:   [245.08 ns 245.73 ns 246.46 ns]
                        change: [+11.111% +11.777% +12.524%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

varint/large/decode     time:   [395.33 ns 396.75 ns 398.51 ns]
                        change: [+17.478% +18.077% +18.702%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

varint/mixed/decode     time:   [258.95 ns 259.78 ns 260.79 ns]
                        change: [-7.8103% -7.1917% -6.5964%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

And viewing the performance of the benchmark under a flamegraph shows that we no longer even see decode_varint in the sampling!

CleanShot 2024-04-16 at 10 19 59@2x

I would ask if you could try out my branch to see if you can reproduce these results. The PR here is just for the sake of showing using my branch of prost as a patched dependency.

I failed to recognize that the slow path exists for scenarios where buf is not contiguous. The conditions though in decode_varint() are what really kill perf. In fact, making it an if {} else {} with the current unrolled loop is better than my rewrite.

That's really cool! I reproduced your result on the same machine as previous benchmarks.

$ cargo bench -- decode/pooled_write_request
    Finished bench [optimized] target(s) in 0.06s
     Running unittests src/lib.rs (target/release/deps/bench_prom-4775082acf9e7701)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 1 filtered out; finished in 0.00s

     Running benches/prom_decode.rs (target/release/deps/prom_decode-6ac74029ce805ccf)
Gnuplot not found, using plotters backend
Benchmarking decode/pooled_write_request: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, enable flat sampling, or reduce sample count to 60.
decode/pooled_write_request
                        time:   [1.0400 ms 1.0411 ms 1.0423 ms]
                        change: [-0.3906% -0.1422% +0.0883%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

I failed to recognize that the slow path exists for scenarios where buf is not contiguous.

Is these changes you made to prost still correct?

Is these changes you made to prost still correct?

No, unfortunately not -- at least not for the general case. For the purposes of your benchmark, or if you are 100% certain you will never have a fragmented Buf it's correct. If for some reason your input buffer is not contiguous it will fail when reading a varint that crosses two chunks which are discontinous.

I'm trying to figure out how the this if condition could be rewritten to result in faster code (I.e. pleasing the branch predictor): https://github.com/tokio-rs/prost/blob/e3deaa200b3a5500bf0403325d02716973b7296a/src/encoding.rs#L54-L63

I found that rewriting it in a few different ways, including adding an extra condition, seems to speed up the prost benchmarks but not the Greptime benchmark. For example:

    if len >= 10 || len == rem {
        let (value, advance) = decode_varint_slice(bytes)?;
        buf.advance(advance);
        Ok(value)
    } else {
        decode_varint_slow(buf)
    }
}

This yields the following results in prost benchmarks:

varint/small/decode     time:   [32.028 ns 32.147 ns 32.279 ns]
                        change: [-84.245% -84.141% -84.048%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

varint/medium/decode    time:   [154.64 ns 155.26 ns 155.99 ns]
                        change: [-29.224% -28.274% -27.316%] (p = 0.00 < 0.05)
                        Performance has improved.

varint/large/decode     time:   [307.91 ns 309.42 ns 311.17 ns]
                        change: [-9.2315% -8.4735% -7.5222%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

varint/mixed/decode     time:   [224.22 ns 228.21 ns 233.62 ns]
                        change: [-17.977% -16.333% -13.952%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

But again, no change in Greptime benchmarks.

I think this certainly highlights though that there are additional tradeoffs that were made, like accounting for discontinuous memory, that are not present in the Go version of protobuf.

This seems like the perfect case for specialization if it's ever stabilized...

My latest change (https://github.com/tokio-rs/prost/commit/7c6da11d342c766e1677ed9e48c3a5f6b5cc5bb5) on the same branch handles non-contiguous memory and for me saw a 19.318% performance improvement:

decode/pooled_write_request
                        time:   [1.3654 ms 1.3715 ms 1.3807 ms]
                        change: [-19.858% -19.318% -18.595%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Curiously though prost's benchmarks still show otherwise overall:

varint/small/decode     time:   [97.933 ns 98.461 ns 99.137 ns]
                        change: [-53.961% -53.067% -52.354%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

varint/medium/decode    time:   [273.87 ns 276.03 ns 278.71 ns]
                        change: [+22.922% +24.307% +25.636%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

varint/large/decode     time:   [392.60 ns 400.02 ns 410.31 ns]
                        change: [+15.371% +20.303% +27.681%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

varint/mixed/decode     time:   [284.02 ns 284.36 ns 284.81 ns]
                        change: [+3.9519% +4.4581% +4.9251%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

v0y4g3r / prom-write-request-bench

Use fork of prost with changes to decode_varint_slice #4