Closed msvbg closed 6 months ago
There should be support for running on ARM via macOS 14 runners. Can you try changing the aarch64 job to run on them and to run the same checks as the x86 jobs?
I'm also pretty sure this is also a breaking change since the MSRV for aarch64 is 1.59, up from the current MSRV of 1.56.
Hold on, I added an additional job but perhaps that's a little excessive.
@james7132 You're right about the MSRV, but somehow it passes on 1.56.0 anyway 🤔
That's certainly odd. Might be a bug with those versions of Rust, or the APIs used here were soft stabilized early.
Could you run the benchmarks and do a comparison against master to ensure this is a performance gain?
Could you run the benchmarks and do a comparison against master to ensure this is a performance gain?
This actually does not look like a performance improvement at all, especially for insert
:
iter_ones/contains_all_zeros
time: [249.43 µs 249.48 µs 249.55 µs]
change: [-1.3684% -1.1830% -1.0116%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low mild
7 (7.00%) high mild
7 (7.00%) high severe
iter_ones/contains_all_ones
time: [318.65 µs 318.90 µs 319.31 µs]
change: [-0.6792% -0.4328% -0.1776%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low severe
2 (2.00%) high mild
12 (12.00%) high severe
iter_ones/all_zeros time: [9.0124 µs 9.0205 µs 9.0350 µs]
change: [-1.6067% -1.1067% -0.6296%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
iter_ones/sparse time: [425.23 µs 425.75 µs 426.50 µs]
change: [-2.7981% -0.9138% +0.2436%] (p = 0.34 > 0.05)
No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
iter_ones/all_ones time: [2.0850 ms 2.0909 ms 2.0956 ms]
change: [-0.2407% +0.1276% +0.4604%] (p = 0.49 > 0.05)
No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
4 (4.00%) low severe
3 (3.00%) high mild
9 (9.00%) high severe
Benchmarking iter_ones/all_ones #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
iter_ones/all_ones #2 time: [1.1386 ms 1.1409 ms 1.1440 ms]
change: [+1.3563% +1.7023% +2.0310%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
6 (6.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
insert_range/1m time: [1.8628 µs 1.8998 µs 1.9302 µs]
change: [+5.2290% +11.538% +18.620%] (p = 0.00 < 0.05)
Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
14 (14.00%) low severe
2 (2.00%) low mild
1 (1.00%) high severe
insert/1m time: [454.00 µs 497.73 µs 556.27 µs]
change: [+69.471% +96.240% +126.39%] (p = 0.00 < 0.05)
Performance has regressed.
intersect_with/1m time: [2.9109 µs 2.9166 µs 2.9242 µs]
change: [+5.8231% +5.9922% +6.1686%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
1 (1.00%) low severe
10 (10.00%) high mild
5 (5.00%) high severe
difference_with/1m time: [2.9193 µs 2.9292 µs 2.9399 µs]
change: [+5.2192% +5.5294% +5.8663%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
union_with/1m time: [2.9099 µs 2.9205 µs 2.9325 µs]
change: [+1.5792% +2.3179% +3.0397%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
symmetric_difference_with/1m
time: [2.9253 µs 2.9343 µs 2.9442 µs]
change: [+5.4782% +5.9696% +6.3993%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
11 (11.00%) high mild
count_ones/1m time: [2.0864 µs 2.0897 µs 2.0935 µs]
change: [+0.0646% +0.2295% +0.4112%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
clear/1m time: [565.88 ns 575.07 ns 584.42 ns]
change: [-5.6735% -4.0915% -2.5815%] (p = 0.00 < 0.05)
Performance has improved.
Benchmarking grow_and_insert: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.6s, enable flat sampling, or reduce sample count to 60.
grow_and_insert time: [1.0669 ms 1.0685 ms 1.0707 ms]
change: [-31.067% -27.660% -23.895%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Will have to take a closer look at what's going on.
Looking at the assembly briefly, it doesn't look like rustc is doing any clever auto-vectorization on master. For BitOrAssign
, which would be the significant operation for insertion, I'm seeing fewer instructions on my branch than on master. But perhaps the ordinary registers are just faster here, for whatever reason. Closing this PR for now.
@msvbg what were your power settings on your MacBook Pro when running these benchmarks? These numbers don't look right to me given what was changed.
Re-ran the benches with RUSTFLAGS="-C target-feature=+neon" cargo bench -- --baseline master
, ensuring that my Mac is in "high power" mode. Still not a very convincing perf improvement, but there's no longer a 96% regression in the insert benchmark.
iter_ones/contains_all_zeros
time: [249.47 µs 249.67 µs 250.10 µs]
change: [-2.7610% -2.2502% -1.7630%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
iter_ones/contains_all_ones
time: [318.56 µs 318.73 µs 319.08 µs]
change: [-2.1921% -1.6749% -1.1876%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe
iter_ones/all_zeros time: [9.1568 µs 9.1673 µs 9.1823 µs]
change: [-0.5514% -0.1511% +0.2790%] (p = 0.48 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
2 (2.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
iter_ones/sparse time: [432.22 µs 433.60 µs 435.04 µs]
change: [+1.9273% +2.3159% +2.6537%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
8 (8.00%) low severe
7 (7.00%) high mild
3 (3.00%) high severe
iter_ones/all_ones time: [2.0885 ms 2.0905 ms 2.0926 ms]
change: [-0.4081% -0.1119% +0.1445%] (p = 0.46 > 0.05)
No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) low severe
1 (1.00%) low mild
9 (9.00%) high mild
3 (3.00%) high severe
Benchmarking iter_ones/all_ones #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
iter_ones/all_ones #2 time: [1.1291 ms 1.1310 ms 1.1326 ms]
change: [-0.8895% -0.5608% -0.2625%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
insert_range/1m time: [2.5106 µs 2.5169 µs 2.5222 µs]
change: [+12.591% +19.891% +26.866%] (p = 0.00 < 0.05)
Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
15 (15.00%) low severe
2 (2.00%) high mild
2 (2.00%) high severe
insert/1m time: [382.28 µs 382.83 µs 383.78 µs]
change: [-2.1876% -2.0004% -1.7855%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
intersect_with/1m time: [2.7413 µs 2.7423 µs 2.7437 µs]
change: [-0.4427% -0.2778% -0.1140%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
difference_with/1m time: [2.7610 µs 2.7674 µs 2.7770 µs]
change: [-0.3222% +0.1216% +0.5720%] (p = 0.59 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
union_with/1m time: [2.7567 µs 2.7603 µs 2.7637 µs]
change: [-0.2862% -0.1305% +0.0254%] (p = 0.11 > 0.05)
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe
symmetric_difference_with/1m
time: [2.7729 µs 2.7744 µs 2.7763 µs]
change: [+0.5935% +0.7573% +0.9161%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
count_ones/1m time: [2.0761 µs 2.0781 µs 2.0806 µs]
change: [+0.0409% +0.2089% +0.3783%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
clear/1m time: [595.17 ns 607.78 ns 619.59 ns]
change: [-3.0057% -1.0915% +0.8554%] (p = 0.31 > 0.05)
No change in performance detected.
Benchmarking grow_and_insert: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
grow_and_insert time: [1.0632 ms 1.0644 ms 1.0658 ms]
change: [-0.3564% -0.1129% +0.0930%] (p = 0.34 > 0.05)
No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
6 (6.00%) high mild
9 (9.00%) high severe
As a continuation of #86, I thought it would be interesting to try to add NEON support to this, given that I haven't written much SIMD code before. It doesn't build continuously on CI yet due to lack of runners, as @james7132 has pointed out, but the tests do pass locally for me on my MBP. Up to you if it's worth merging anyway!