Open obi1kenobi opened 2 months ago
super quick first hacky test:
Running benches/read_rustdoc.rs (target/release/deps/read_rustdoc-cecf84416ef1155c)
Benchmarking aws_sdk_ec2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 324.3s, or reduce sample count to 10.
aws_sdk_ec2 time: [2.7113 s 2.8143 s 2.9515 s]
change: [-10.535% -6.8687% -2.5764%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
Benchmarking associated_items_hidden_from_public_api: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
associated_items_hidden_from_public_api
time: [1.4220 ms 1.4288 ms 1.4368 ms]
change: [-20.113% -17.691% -16.064%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
Benchmarking async_impl_future_equivalence: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
async_impl_future_equivalence
time: [1.5397 ms 1.5909 ms 1.6484 ms]
change: [-20.642% -16.935% -13.273%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) high mild
3 (3.00%) high severe
Benchmarking auto_trait_impl_removed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
auto_trait_impl_removed time: [1.6121 ms 1.6854 ms 1.7910 ms]
change: [-14.406% -10.084% -5.5161%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) high mild
11 (11.00%) high severe
Benchmarking broken_rustdoc: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, enable flat sampling, or reduce sample count to 50.
broken_rustdoc time: [1.4206 ms 1.4333 ms 1.4533 ms]
change: [-20.782% -18.894% -16.928%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
1 (1.00%) high mild
5 (5.00%) high severe
constructible_struct_adds_field
time: [1.9456 ms 1.9870 ms 2.0368 ms]
change: [-20.827% -17.721% -14.686%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
constructible_struct_adds_private_field
time: [2.0114 ms 2.0590 ms 2.1110 ms]
change: [-14.397% -12.036% -9.5048%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
Benchmarking counterexample_for_ignoring_doc_hidden_items: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.6s, enable flat sampling, or reduce sample count to 50.
counterexample_for_ignoring_doc_hidden_items
time: [1.4559 ms 1.4859 ms 1.5221 ms]
change: [-20.654% -18.320% -15.948%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe
Benchmarking derive_trait_impl_removed: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
derive_trait_impl_removed
time: [1.7363 ms 1.7814 ms 1.8291 ms]
change: [-9.5009% -6.8148% -3.5745%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
Benchmarking enum_missing: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50.
enum_missing time: [1.5038 ms 1.5750 ms 1.6593 ms]
change: [-15.154% -11.801% -7.8915%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) high mild
11 (11.00%) high severe
Assuming compile times don't get drastically worse, and that we wouldn't have some platform-related limitations by using simd-json
, that seems worth shipping! I'd take a cheap 6-20% improvement anytime :)
I will like to tackle this issue with @Licenser, as to ensure its not being worked on by multiple people:)
Hi, so along with @Licenser , we created a repo that extracted the key functionalities to allow easy benchmarking for others. Just a cargo bench
command away.
My results for last run (didnt include ouput of all of them since it would be too long)
Gnuplot not found, using plotters backend
async_impl_future_equivalence/SIMD/parse
time: [1.2377 ms 1.2453 ms 1.2545 ms]
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe
Benchmarking async_impl_future_equivalence/SERDE/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 50.
async_impl_future_equivalence/SERDE/parse
time: [1.3678 ms 1.3811 ms 1.3938 ms]
Found 11 outliers among 100 measurements (11.00%)
9 (9.00%) high mild
2 (2.00%) high severe
auto_trait_impl_removed/SIMD/parse
time: [1.4148 ms 1.4170 ms 1.4192 ms]
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
auto_trait_impl_removed/SERDE/parse
time: [1.5048 ms 1.5095 ms 1.5150 ms]
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe
Benchmarking broken_rustdoc/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.4s, enable flat sampling, or reduce sample count to 50.
broken_rustdoc/SIMD/parse
time: [1.1311 ms 1.1615 ms 1.1874 ms]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
broken_rustdoc/SERDE/parse
time: [1.3656 ms 1.3718 ms 1.3805 ms]
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high severe
constructible_struct_adds_field/SIMD/parse
time: [1.2849 ms 1.2906 ms 1.2973 ms]
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) high mild
8 (8.00%) high severe
constructible_struct_adds_field/SERDE/parse
time: [1.8033 ms 1.8195 ms 1.8392 ms]
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
Benchmarking constructible_struct_adds_private_field/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
constructible_struct_adds_private_field/SIMD/parse
time: [1.3073 ms 1.3102 ms 1.3140 ms]
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) low mild
2 (2.00%) high mild
4 (4.00%) high severe
constructible_struct_adds_private_field/SERDE/parse
time: [1.8016 ms 1.8033 ms 1.8050 ms]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low mild
1 (1.00%) high mild
Sometimes results vary slightly more but maybe also @Licenser can share some of his results as well. I thought it would also be nice for others to try to share what they get, and then we can (depending on results) start following the next steps like maybe creating CI runs for it.
As far as the crate you wanted to runaws ec2
this is my result:
Benchmarking test_aws_ec2/SIMD/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 445.6s, or reduce sample count to 10.
test_aws_ec2/SIMD/parse time: [2.9595 s 2.9653 s 2.9725 s]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
Benchmarking test_aws_ec2/SERDE/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 388.0s, or reduce sample count to 10.
test_aws_ec2/SERDE/parse
time: [2.4378 s 2.4407 s 2.4437 s]
Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀
Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀
Well apparently so, but maybe something with my system as with Licenser's system it was not the case if i recall, so maybe he can chime in later. Thats why I wanted others to also test on that rustdoc
to see how is the performance varied.
Am I reading the data correctly, that the EC2 crate parses half a second slower with SIMD than without? 👀
Yes, the context here is important. Those tests are run on a laptop, where CPUs tend to thermal throttle massively when more power-hungry instructions (aka SIMD) are run.
That's part of why we set up a reproduction repository to get a better feeling about how the people running the tool will be affected; @dmatos2012 can share some numbers from a server too and I'll run them on my own machine later today as well
Here the results from a Digital Oceans (2 core, 4GB memory) server results-server.txt
The test crates aren't really the best representative for this kind of performance work. It's good to have a few of them for "just in case" reasons, but they are generally a few orders of magnitude too small to be a realistic stand-in for most real-world Rust crates.
The crates where we do want to demonstrate consistent speedups are things like tokio
, aws-sdk-ec2
, bevy
, clap
, serde
, etc. Adopting SIMD makes sense if medium-to-large crates like these benefit significantly, while tiny crates (like our test crates) are at worst ~unchanged or even a bit slower.
An additional axis I'd love to explore is how GitHub Actions runners fare with SIMD, on Linux / macOS / Windows. GitHub Actions is where a ton of our users will run cargo-semver-checks
on their real-world crates, so it's important that we can present them with a speedup, or at least not-a-slowdown.
ja copying the 500MB file to the server takes time 😂 , it's a WIP
but FWIW results from my local machine are already in for the aws crate
aws_sdk_ec2/SIMD/parse time: [1.3610 s 1.3683 s 1.3761 s]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
aws_sdk_ec2/SERDE/parse time: [1.5445 s 1.5497 s 1.5557 s]
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
I'll upload a full run of all tests when they're dune
Thanks for being super thorough on this, I appreciate it!
I'm a bit surprised the SIMD benefit is only about ~12%. Is there something we might be doing that is pessimizing for SIMD? Could we adapt our approach in some way to make it better for SIMD decoding?
I will get later the results from the bigger crates you mentioned later, to see what the results are.
It's good to know where the end goal is rather than extracting couple of ms
from extra small test_crates json
.
We intially thought to start with the smaller ones as those easier to run and debug.
I'm a bit surprised the SIMD benefit is only about ~12%. Is there something we might be doing that is pessimizing for SIMD? Could we adapt our approach in some way to make it better for SIMD decoding?
A initial one would be using simd-json-derive instead of serde derive but I think that'd mean having a paralell crate for the rustdoc-types which could be challenging but there is a good bit of gains to be had usually - or even going untyped and using the tape
representation but that feels like an anti pettern.
I could see some optimisations to be made to simd-json too but I'm not sure how feasable they are
a big thing is that simd-json copies the entire data in a seperate buffer and keeps an additional string buffer (that's lots of size) on a server the go-to way would be to re-use the buffers which armortizes that to a degree for a one of tool that's not an option, but it might be possible to avoid some of that allocation and copying (but there are benchmarks to be had to validate that as we'd trade branches for memory consumption) that'd allos allow us to use borrowd strings and for big crates that would avoid a ton of copying
then there is the option to in-line the utf8 checking better but we'd need to validate first if we catually copy the data for the utf8 checks or if the compiler is smart enough to re-use registers (it often is)
A initial one would be using simd-json-derive instead of serde derive but I think that'd mean having a paralell crate for the rustdoc-types which could be challenging but there is a good bit of gains to be had usually
If we were able to show a concrete experiment that demonstrates significant benefits across a range of crates, I'd be happy to bring this up with the rustdoc maintainers as an option for upstreaming into the rustdoc-types
crate. I obviously can't promise any specific outcome, but they are generally very receptive about improvements and performance so it's certainly worth a shot!
If there is a chance, it's worth doing the research :) if @dmatos2012 we can go through that experiment together.
Yes, for sure @Licenser, if we do like with this initial task and I have kinda some guidance looks like something interesting to experiment indeed :)
Many of the rustdoc JSON files we parse are large — from a few MB to ~500MB in size. In the largest cases, we spend ~5s parsing JSON per
cargo-semver-checks
run.Speeding up JSON parsing by switching from
serde_json
tosimd-json
might be an easy win. Let's check!simd-json
aws-sdk-ec2
cd
into the crate's directory, and use theRUSTDOCFLAGS="-Z unstable-options --document-private-items --document-hidden-items --output-format=json --cap-lints=allow" cargo doc --lib --no-deps
shell command which will putaws-sdk-ec2.json
into thetarget/doc
directory of the workspacedivan
orcriterion
to ensure the perf delta is real and not a measurement artifactcargo test
loads dozens of small JSON files, so you can check thoseThis is a good first issue for someone who is already familiar with Rust and wants to start contributing to this project. It might not be a good first issue for folks new to Rust.