Add NEON support - Githubissues

msvbg commented 6 months ago

As a continuation of #86, I thought it would be interesting to try to add NEON support to this, given that I haven't written much SIMD code before. It doesn't build continuously on CI yet due to lack of runners, as @james7132 has pointed out, but the tests do pass locally for me on my MBP. Up to you if it's worth merging anyway!

james7132 commented 6 months ago

There should be support for running on ARM via macOS 14 runners. Can you try changing the aarch64 job to run on them and to run the same checks as the x86 jobs?

I'm also pretty sure this is also a breaking change since the MSRV for aarch64 is 1.59, up from the current MSRV of 1.56.

msvbg commented 6 months ago

Hold on, I added an additional job but perhaps that's a little excessive.

msvbg commented 6 months ago

@james7132 You're right about the MSRV, but somehow it passes on 1.56.0 anyway 🤔

james7132 commented 6 months ago

That's certainly odd. Might be a bug with those versions of Rust, or the APIs used here were soft stabilized early.

Could you run the benchmarks and do a comparison against master to ensure this is a performance gain?

msvbg commented 6 months ago

Could you run the benchmarks and do a comparison against master to ensure this is a performance gain?

This actually does not look like a performance improvement at all, especially for insert:

iter_ones/contains_all_zeros
                        time:   [249.43 µs 249.48 µs 249.55 µs]
                        change: [-1.3684% -1.1830% -1.0116%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  7 (7.00%) high severe

iter_ones/contains_all_ones
                        time:   [318.65 µs 318.90 µs 319.31 µs]
                        change: [-0.6792% -0.4328% -0.1776%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  12 (12.00%) high severe

iter_ones/all_zeros     time:   [9.0124 µs 9.0205 µs 9.0350 µs]
                        change: [-1.6067% -1.1067% -0.6296%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe

iter_ones/sparse        time:   [425.23 µs 425.75 µs 426.50 µs]
                        change: [-2.7981% -0.9138% +0.2436%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

iter_ones/all_ones      time:   [2.0850 ms 2.0909 ms 2.0956 ms]
                        change: [-0.2407% +0.1276% +0.4604%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low severe
  3 (3.00%) high mild
  9 (9.00%) high severe

Benchmarking iter_ones/all_ones #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
iter_ones/all_ones #2   time:   [1.1386 ms 1.1409 ms 1.1440 ms]
                        change: [+1.3563% +1.7023% +2.0310%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  6 (6.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

insert_range/1m         time:   [1.8628 µs 1.8998 µs 1.9302 µs]
                        change: [+5.2290% +11.538% +18.620%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  14 (14.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high severe

insert/1m               time:   [454.00 µs 497.73 µs 556.27 µs]
                        change: [+69.471% +96.240% +126.39%] (p = 0.00 < 0.05)
                        Performance has regressed.

intersect_with/1m       time:   [2.9109 µs 2.9166 µs 2.9242 µs]
                        change: [+5.8231% +5.9922% +6.1686%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  10 (10.00%) high mild
  5 (5.00%) high severe

difference_with/1m      time:   [2.9193 µs 2.9292 µs 2.9399 µs]
                        change: [+5.2192% +5.5294% +5.8663%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

union_with/1m           time:   [2.9099 µs 2.9205 µs 2.9325 µs]
                        change: [+1.5792% +2.3179% +3.0397%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

symmetric_difference_with/1m
                        time:   [2.9253 µs 2.9343 µs 2.9442 µs]
                        change: [+5.4782% +5.9696% +6.3993%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high mild

count_ones/1m           time:   [2.0864 µs 2.0897 µs 2.0935 µs]
                        change: [+0.0646% +0.2295% +0.4112%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

clear/1m                time:   [565.88 ns 575.07 ns 584.42 ns]
                        change: [-5.6735% -4.0915% -2.5815%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking grow_and_insert: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.6s, enable flat sampling, or reduce sample count to 60.
grow_and_insert         time:   [1.0669 ms 1.0685 ms 1.0707 ms]
                        change: [-31.067% -27.660% -23.895%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Will have to take a closer look at what's going on.

msvbg commented 6 months ago

Looking at the assembly briefly, it doesn't look like rustc is doing any clever auto-vectorization on master. For BitOrAssign, which would be the significant operation for insertion, I'm seeing fewer instructions on my branch than on master. But perhaps the ordinary registers are just faster here, for whatever reason. Closing this PR for now.

james7132 commented 6 months ago

@msvbg what were your power settings on your MacBook Pro when running these benchmarks? These numbers don't look right to me given what was changed.

msvbg commented 6 months ago

Re-ran the benches with RUSTFLAGS="-C target-feature=+neon" cargo bench -- --baseline master, ensuring that my Mac is in "high power" mode. Still not a very convincing perf improvement, but there's no longer a 96% regression in the insert benchmark.

iter_ones/contains_all_zeros
                        time:   [249.47 µs 249.67 µs 250.10 µs]
                        change: [-2.7610% -2.2502% -1.7630%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

iter_ones/contains_all_ones
                        time:   [318.56 µs 318.73 µs 319.08 µs]
                        change: [-2.1921% -1.6749% -1.1876%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

iter_ones/all_zeros     time:   [9.1568 µs 9.1673 µs 9.1823 µs]
                        change: [-0.5514% -0.1511% +0.2790%] (p = 0.48 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

iter_ones/sparse        time:   [432.22 µs 433.60 µs 435.04 µs]
                        change: [+1.9273% +2.3159% +2.6537%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  8 (8.00%) low severe
  7 (7.00%) high mild
  3 (3.00%) high severe

iter_ones/all_ones      time:   [2.0885 ms 2.0905 ms 2.0926 ms]
                        change: [-0.4081% -0.1119% +0.1445%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  9 (9.00%) high mild
  3 (3.00%) high severe

Benchmarking iter_ones/all_ones #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
iter_ones/all_ones #2   time:   [1.1291 ms 1.1310 ms 1.1326 ms]
                        change: [-0.8895% -0.5608% -0.2625%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

insert_range/1m         time:   [2.5106 µs 2.5169 µs 2.5222 µs]
                        change: [+12.591% +19.891% +26.866%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  15 (15.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

insert/1m               time:   [382.28 µs 382.83 µs 383.78 µs]
                        change: [-2.1876% -2.0004% -1.7855%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

intersect_with/1m       time:   [2.7413 µs 2.7423 µs 2.7437 µs]
                        change: [-0.4427% -0.2778% -0.1140%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

difference_with/1m      time:   [2.7610 µs 2.7674 µs 2.7770 µs]
                        change: [-0.3222% +0.1216% +0.5720%] (p = 0.59 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

union_with/1m           time:   [2.7567 µs 2.7603 µs 2.7637 µs]
                        change: [-0.2862% -0.1305% +0.0254%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

symmetric_difference_with/1m
                        time:   [2.7729 µs 2.7744 µs 2.7763 µs]
                        change: [+0.5935% +0.7573% +0.9161%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

count_ones/1m           time:   [2.0761 µs 2.0781 µs 2.0806 µs]
                        change: [+0.0409% +0.2089% +0.3783%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

clear/1m                time:   [595.17 ns 607.78 ns 619.59 ns]
                        change: [-3.0057% -1.0915% +0.8554%] (p = 0.31 > 0.05)
                        No change in performance detected.

Benchmarking grow_and_insert: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
grow_and_insert         time:   [1.0632 ms 1.0644 ms 1.0658 ms]
                        change: [-0.3564% -0.1129% +0.0930%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

petgraph / fixedbitset

Add NEON support #115