obi1kenobi / cargo-semver-checks

Scan your Rust crate for semver violations.
Apache License 2.0
1.18k stars 75 forks source link

Check if using `simd-json` to parse rustdoc produces any speedup #885

Open obi1kenobi opened 2 months ago

obi1kenobi commented 2 months ago

Many of the rustdoc JSON files we parse are large — from a few MB to ~500MB in size. In the largest cases, we spend ~5s parsing JSON per cargo-semver-checks run.

Speeding up JSON parsing by switching from serde_json to simd-json might be an easy win. Let's check!

This is a good first issue for someone who is already familiar with Rust and wants to start contributing to this project. It might not be a good first issue for folks new to Rust.

Licenser commented 2 months ago

super quick first hacky test:

     Running benches/read_rustdoc.rs (target/release/deps/read_rustdoc-cecf84416ef1155c)
Benchmarking aws_sdk_ec2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 324.3s, or reduce sample count to 10.
aws_sdk_ec2             time:   [2.7113 s 2.8143 s 2.9515 s]
                        change: [-10.535% -6.8687% -2.5764%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking associated_items_hidden_from_public_api: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
associated_items_hidden_from_public_api
                        time:   [1.4220 ms 1.4288 ms 1.4368 ms]
                        change: [-20.113% -17.691% -16.064%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Benchmarking async_impl_future_equivalence: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
async_impl_future_equivalence
                        time:   [1.5397 ms 1.5909 ms 1.6484 ms]
                        change: [-20.642% -16.935% -13.273%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

Benchmarking auto_trait_impl_removed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
auto_trait_impl_removed time:   [1.6121 ms 1.6854 ms 1.7910 ms]
                        change: [-14.406% -10.084% -5.5161%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

Benchmarking broken_rustdoc: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
broken_rustdoc          time:   [1.4206 ms 1.4333 ms 1.4533 ms]
                        change: [-20.782% -18.894% -16.928%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

constructible_struct_adds_field
                        time:   [1.9456 ms 1.9870 ms 2.0368 ms]
                        change: [-20.827% -17.721% -14.686%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

constructible_struct_adds_private_field
                        time:   [2.0114 ms 2.0590 ms 2.1110 ms]
                        change: [-14.397% -12.036% -9.5048%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

Benchmarking counterexample_for_ignoring_doc_hidden_items: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.6s, enable flat sampling, or reduce sample count to 50.
counterexample_for_ignoring_doc_hidden_items
                        time:   [1.4559 ms 1.4859 ms 1.5221 ms]
                        change: [-20.654% -18.320% -15.948%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking derive_trait_impl_removed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
derive_trait_impl_removed
                        time:   [1.7363 ms 1.7814 ms 1.8291 ms]
                        change: [-9.5009% -6.8148% -3.5745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking enum_missing: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
enum_missing            time:   [1.5038 ms 1.5750 ms 1.6593 ms]
                        change: [-15.154% -11.801% -7.8915%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
obi1kenobi commented 2 months ago

Assuming compile times don't get drastically worse, and that we wouldn't have some platform-related limitations by using simd-json, that seems worth shipping! I'd take a cheap 6-20% improvement anytime :)

dmatos2012 commented 2 months ago

I will like to tackle this issue with @Licenser, as to ensure its not being worked on by multiple people:)

dmatos2012 commented 2 months ago

Hi, so along with @Licenser , we created a repo that extracted the key functionalities to allow easy benchmarking for others. Just a cargo bench command away.

My results for last run (didnt include ouput of all of them since it would be too long)

Gnuplot not found, using plotters backend
async_impl_future_equivalence/SIMD/parse
                        time:   [1.2377 ms 1.2453 ms 1.2545 ms]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking async_impl_future_equivalence/SERDE/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 50.
async_impl_future_equivalence/SERDE/parse
                        time:   [1.3678 ms 1.3811 ms 1.3938 ms]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

auto_trait_impl_removed/SIMD/parse
                        time:   [1.4148 ms 1.4170 ms 1.4192 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
auto_trait_impl_removed/SERDE/parse
                        time:   [1.5048 ms 1.5095 ms 1.5150 ms]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking broken_rustdoc/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
broken_rustdoc/SIMD/parse
                        time:   [1.1311 ms 1.1615 ms 1.1874 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
broken_rustdoc/SERDE/parse
                        time:   [1.3656 ms 1.3718 ms 1.3805 ms]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high severe

constructible_struct_adds_field/SIMD/parse
                        time:   [1.2849 ms 1.2906 ms 1.2973 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
constructible_struct_adds_field/SERDE/parse
                        time:   [1.8033 ms 1.8195 ms 1.8392 ms]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

Benchmarking constructible_struct_adds_private_field/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
constructible_struct_adds_private_field/SIMD/parse
                        time:   [1.3073 ms 1.3102 ms 1.3140 ms]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
constructible_struct_adds_private_field/SERDE/parse
                        time:   [1.8016 ms 1.8033 ms 1.8050 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

Sometimes results vary slightly more but maybe also @Licenser can share some of his results as well. I thought it would also be nice for others to try to share what they get, and then we can (depending on results) start following the next steps like maybe creating CI runs for it.

As far as the crate you wanted to runaws ec2 this is my result:

Benchmarking test_aws_ec2/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 445.6s, or reduce sample count to 10.
test_aws_ec2/SIMD/parse time:   [2.9595 s 2.9653 s 2.9725 s]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking test_aws_ec2/SERDE/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 388.0s, or reduce sample count to 10.
test_aws_ec2/SERDE/parse
                        time:   [2.4378 s 2.4407 s 2.4437 s]
obi1kenobi commented 2 months ago

Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀

dmatos2012 commented 2 months ago

Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀

Well apparently so, but maybe something with my system as with Licenser's system it was not the case if i recall, so maybe he can chime in later. Thats why I wanted others to also test on that rustdoc to see how is the performance varied.

Licenser commented 2 months ago

Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀

Yes, the context here is important. Those tests are run on a laptop, where CPUs tend to thermal throttle massively when more power-hungry instructions (aka SIMD) are run.

That's part of why we set up a reproduction repository to get a better feeling about how the people running the tool will be affected; @dmatos2012 can share some numbers from a server too and I'll run them on my own machine later today as well

Licenser commented 2 months ago

Here the results from a Digital Oceans (2 core, 4GB memory) server results-server.txt

obi1kenobi commented 2 months ago

The test crates aren't really the best representative for this kind of performance work. It's good to have a few of them for "just in case" reasons, but they are generally a few orders of magnitude too small to be a realistic stand-in for most real-world Rust crates.

The crates where we do want to demonstrate consistent speedups are things like tokio, aws-sdk-ec2, bevy, clap, serde, etc. Adopting SIMD makes sense if medium-to-large crates like these benefit significantly, while tiny crates (like our test crates) are at worst ~unchanged or even a bit slower.

An additional axis I'd love to explore is how GitHub Actions runners fare with SIMD, on Linux / macOS / Windows. GitHub Actions is where a ton of our users will run cargo-semver-checks on their real-world crates, so it's important that we can present them with a speedup, or at least not-a-slowdown.

Licenser commented 2 months ago

ja copying the 500MB file to the server takes time 😂 , it's a WIP

Licenser commented 2 months ago

but FWIW results from my local machine are already in for the aws crate

aws_sdk_ec2/SIMD/parse  time:   [1.3610 s 1.3683 s 1.3761 s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
aws_sdk_ec2/SERDE/parse time:   [1.5445 s 1.5497 s 1.5557 s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

I'll upload a full run of all tests when they're dune

obi1kenobi commented 2 months ago

Thanks for being super thorough on this, I appreciate it!

I'm a bit surprised the SIMD benefit is only about ~12%. Is there something we might be doing that is pessimizing for SIMD? Could we adapt our approach in some way to make it better for SIMD decoding?

dmatos2012 commented 2 months ago

I will get later the results from the bigger crates you mentioned later, to see what the results are.

It's good to know where the end goal is rather than extracting couple of ms from extra small test_crates json.

We intially thought to start with the smaller ones as those easier to run and debug.

Licenser commented 2 months ago

I'm a bit surprised the SIMD benefit is only about ~12%. Is there something we might be doing that is pessimizing for SIMD? Could we adapt our approach in some way to make it better for SIMD decoding?

A initial one would be using simd-json-derive instead of serde derive but I think that'd mean having a paralell crate for the rustdoc-types which could be challenging but there is a good bit of gains to be had usually - or even going untyped and using the tape representation but that feels like an anti pettern.

I could see some optimisations to be made to simd-json too but I'm not sure how feasable they are

a big thing is that simd-json copies the entire data in a seperate buffer and keeps an additional string buffer (that's lots of size) on a server the go-to way would be to re-use the buffers which armortizes that to a degree for a one of tool that's not an option, but it might be possible to avoid some of that allocation and copying (but there are benchmarks to be had to validate that as we'd trade branches for memory consumption) that'd allos allow us to use borrowd strings and for big crates that would avoid a ton of copying

then there is the option to in-line the utf8 checking better but we'd need to validate first if we catually copy the data for the utf8 checks or if the compiler is smart enough to re-use registers (it often is)

obi1kenobi commented 2 months ago

A initial one would be using simd-json-derive instead of serde derive but I think that'd mean having a paralell crate for the rustdoc-types which could be challenging but there is a good bit of gains to be had usually

If we were able to show a concrete experiment that demonstrates significant benefits across a range of crates, I'd be happy to bring this up with the rustdoc maintainers as an option for upstreaming into the rustdoc-types crate. I obviously can't promise any specific outcome, but they are generally very receptive about improvements and performance so it's certainly worth a shot!

Licenser commented 2 months ago

If there is a chance, it's worth doing the research :) if @dmatos2012 we can go through that experiment together.

dmatos2012 commented 2 months ago

Yes, for sure @Licenser, if we do like with this initial task and I have kinda some guidance looks like something interesting to experiment indeed :)