rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
903 stars 81 forks source link

Not using WASM bitmask / all_true instructions #351

Closed stepantubanov closed 1 year ago

stepantubanov commented 1 year ago

Hi,

Not sure if this belongs in here or in compiler repo. Example code (https://godbolt.org/z/4fqcdGqq6):

#![feature(portable_simd)]

use std::simd::{self, SimdPartialEq, ToBitMask};

pub fn to_bitmask(v: &[u8; 16]) -> usize {
    let data = simd::u8x16::from_array(*v);
    let zero = simd::u8x16::splat(0);
    let mask = data.simd_eq(zero);
    mask.to_bitmask() as usize
}

pub fn all_zeros(v: &[u8; 16]) -> bool {
    let data = simd::u8x16::from_array(*v);
    let zero = simd::u8x16::splat(0);
    let mask = data.simd_eq(zero);
    mask.all()
}

Ideally it should've been compiled using i8x16.bitmask and i8x16.all_true instructions.

https://github.com/WebAssembly/simd/blob/main/proposals/simd/SIMD.md#all-lanes-true

Instead, it generated some really slow code extracting individual lanes. Full wasm output:

to_bitmask wasm ``` example::to_bitmask: global.get __stack_pointer i32.const 16 i32.sub drop local.get 0 v128.load 0:p2align=0 v128.const 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 i8x16.eq local.tee 1 i8x16.extract_lane_u 0 i32.const 1 i32.and local.get 1 i8x16.extract_lane_u 1 i32.const 1 i32.and i32.const 1 i32.shl i32.or local.get 1 i8x16.extract_lane_u 2 i32.const 1 i32.and i32.const 2 i32.shl i32.or local.get 1 i8x16.extract_lane_u 3 i32.const 1 i32.and i32.const 3 i32.shl i32.or local.get 1 i8x16.extract_lane_u 4 i32.const 1 i32.and i32.const 4 i32.shl i32.or local.get 1 i8x16.extract_lane_u 5 i32.const 1 i32.and i32.const 5 i32.shl i32.or local.get 1 i8x16.extract_lane_u 6 i32.const 1 i32.and i32.const 6 i32.shl i32.or local.get 1 i8x16.extract_lane_u 7 i32.const 1 i32.and i32.const 7 i32.shl i32.or local.get 1 i8x16.extract_lane_u 8 i32.const 1 i32.and i32.const 8 i32.shl i32.or local.get 1 i8x16.extract_lane_u 9 i32.const 1 i32.and i32.const 9 i32.shl i32.or local.get 1 i8x16.extract_lane_u 10 i32.const 1 i32.and i32.const 10 i32.shl i32.or local.get 1 i8x16.extract_lane_u 11 i32.const 1 i32.and i32.const 11 i32.shl i32.or local.get 1 i8x16.extract_lane_u 12 i32.const 1 i32.and i32.const 12 i32.shl i32.or local.get 1 i8x16.extract_lane_u 13 i32.const 1 i32.and i32.const 13 i32.shl i32.or local.get 1 i8x16.extract_lane_u 14 i32.const 1 i32.and i32.const 14 i32.shl i32.or local.get 1 i8x16.extract_lane_u 15 i32.const 15 i32.shl i32.or i32.const 65535 i32.and end_function ```
all_zeros wasm ``` example::all_zeros: global.get __stack_pointer i32.const 16 i32.sub drop local.get 0 v128.load 0:p2align=0 v128.const 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 i8x16.ne local.tee 1 i8x16.extract_lane_u 0 i32.const 1 i32.and local.get 1 i8x16.extract_lane_u 1 i32.const 1 i32.and i32.const 1 i32.shl i32.or local.get 1 i8x16.extract_lane_u 2 i32.const 1 i32.and i32.const 2 i32.shl i32.or local.get 1 i8x16.extract_lane_u 3 i32.const 1 i32.and i32.const 3 i32.shl i32.or local.get 1 i8x16.extract_lane_u 4 i32.const 1 i32.and i32.const 4 i32.shl i32.or local.get 1 i8x16.extract_lane_u 5 i32.const 1 i32.and i32.const 5 i32.shl i32.or local.get 1 i8x16.extract_lane_u 6 i32.const 1 i32.and i32.const 6 i32.shl i32.or local.get 1 i8x16.extract_lane_u 7 i32.const 1 i32.and i32.const 7 i32.shl i32.or local.get 1 i8x16.extract_lane_u 8 i32.const 1 i32.and i32.const 8 i32.shl i32.or local.get 1 i8x16.extract_lane_u 9 i32.const 1 i32.and i32.const 9 i32.shl i32.or local.get 1 i8x16.extract_lane_u 10 i32.const 1 i32.and i32.const 10 i32.shl i32.or local.get 1 i8x16.extract_lane_u 11 i32.const 1 i32.and i32.const 11 i32.shl i32.or local.get 1 i8x16.extract_lane_u 12 i32.const 1 i32.and i32.const 12 i32.shl i32.or local.get 1 i8x16.extract_lane_u 13 i32.const 1 i32.and i32.const 13 i32.shl i32.or local.get 1 i8x16.extract_lane_u 14 i32.const 1 i32.and i32.const 14 i32.shl i32.or local.get 1 i8x16.extract_lane_u 15 i32.const 15 i32.shl i32.or i32.const 65535 i32.and i32.eqz end_function ```

Meta

rustc --version --verbose:

rustc 1.71.0-nightly (521f4dae1 2023-05-19)
calebzulawski commented 1 year ago

This is the correct place to file this issue, thanks!

In this case, LLVM is lowering to wasm poorly. I'm working on a change that should fix all()/any() first, then I'll investigate to_bitmask.

calebzulawski commented 1 year ago

This is fixed in https://github.com/llvm/llvm-project/commit/8392bf6000ad039bd0e55383d40a05ddf7b4af13, not sure how long until this makes it to nightly

calebzulawski commented 1 year ago

Resolved via https://github.com/llvm/llvm-project/commit/8392bf6000ad039bd0e55383d40a05ddf7b4af13 and rust-lang/rust#114048

stepantubanov commented 1 year ago

@calebzulawski Awesome, thank you! 🚀