The f32x16 performance is a bit variable. For example, in the following output forward_f32x16 regresses, but forward_any (which is essentially the same thing) still performs faster:
forward_f32x8 time: [35.786 ms 35.958 ms 36.167 ms]
change: [+1.0369% +1.4752% +2.1143%] (p = 0.00 < 0.05)
Performance has regressed.
forward_f32x16 time: [23.496 ms 23.512 ms 23.529 ms]
change: [+26.183% +26.334% +26.469%] (p = 0.00 < 0.05)
Performance has regressed.
forward_f64x4 time: [76.080 ms 76.117 ms 76.158 ms]
change: [+0.0682% +0.1455% +0.2177%] (p = 0.00 < 0.05)
Change within noise threshold.
forward_f64x8 time: [32.219 ms 32.387 ms 32.571 ms]
change: [-0.4876% +0.3174% +1.1074%] (p = 0.43 > 0.05)
No change in performance detected.
forward_any time: [18.615 ms 18.638 ms 18.667 ms]
change: [-0.1260% +0.1421% +0.3847%] (p = 0.28 > 0.05)
No change in performance detected.
The
f32x16
performance is a bit variable. For example, in the following outputforward_f32x16
regresses, butforward_any
(which is essentially the same thing) still performs faster: