rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
903 stars 81 forks source link

Add support for masked loads & stores #374

Closed farnoy closed 8 months ago

farnoy commented 1 year ago

Resolves the masked loads/stores checklist item from https://github.com/rust-lang/portable-simd/issues/16

Depends on rustc exposing new intrinsics - https://github.com/rust-lang/rust/pull/117953

Added a basic test, do we need to auto-generate tests with macros to cover all types & lane sizes?

farnoy commented 1 year ago

@calebzulawski I pushed out a new version with the changes we discussed.

I've also added temporary toggles for easier benchmarking (USE_BRANCH and USE_BITMASK).

Here's some data on a simple benchmark reading from one u8 vector and zero-extending it to u16 to write back to another vector. The write buffer is small (1300B) and fully cached. The benchmark is DRAM-bound. This is on Zen 4 (7950X3D) with -C target-cpu=znver4

  1. USE_BRANCH=true,USE_BITMASK=true at ~45GB/s here's the hot loop
  2. USE_BRANCH=false,USE_BITMASK=true is at ~41GB/s - hot loop
  3. USE_BRANCH=true,USE_BITMASK=false 32GB/s - hot loop
  4. USE_BRANCH=false,USE_BITMASK=false at 2.4GB/s - hot loop

That last one is interesting as an outlier - uica disagrees that it should be >10x slower. I will investigate these on Sapphire Rapids and Graviton 3. I'm also curious if there's any way to do i8::try_from(slice.len()).unwrap_or(i8::MAX) more efficiently. And there also seem to be too many branches per iteration, perhaps it has something to do with how I defined my input & output vectors?

    let len = bytestring.len();
    output.clear();
    output.reserve(characters);

    unsafe {
        // contents are uninitialized until the loop below ends
        output.set_len(len);
    }

    for i in (0..len).step_by(N) {
        let data = Simd::<u8, N>::load_or_default(&bytestring[i..]);
        let data2: Simd<u16, N> = data.cast();
        data2.masked_store(&mut output[i..], Mask::splat(true));
    }

And I've set N=32 for this run, expecting a ymm load that expands into a zmm store

EDIT: It's also curious that the masks get calculated twice for USE_BITMASK=false, once to load, then separately to store. It seems to be because of varying lane size - u8 vs u16.

farnoy commented 1 year ago

I optimized USE_BITMASK=false to always do the comparison on i8 masks, which avoids the redundant computation. It's a substantial perf win, but it still misbehaves at specific vector sizes.

farnoy commented 11 months ago

I've updated it for the final version of these intrinsics that shipped in nightly, but I have no idea why it fails on CI, something about the imports for ToBitmask that I've added

workingjubilee commented 11 months ago

I've also added temporary toggles for easier benchmarking (USE_BRANCH and USE_BITMASK).

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

I'll take a look at the import problem.

calebzulawski commented 11 months ago

I would like fuzz tests especially, I know it's not enabled but I'm particularly suspicious of the bitmask implementation

farnoy commented 11 months ago

I would like fuzz tests especially, I know it's not enabled but I'm particularly suspicious of the bitmask implementation

@calebzulawski What do you mean by that, exactly? Do we have facilities in this library already for fuzz testing?

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

@workingjubilee Is that the approach we want to take, given it will have an impact on the trait bounds we publish for these new functions? If we focused on USE_BRANCH=true and especially USE_BITMASK=true/false, we could simplify the trait bounds. Unless it's not a problem if we use a Sealed trait, but I don't usually write library code so I wouldn't know.

I'm not sure if these toggles are needed, or if the API interface is the final one we should use.

In my testing, even AVX-512 can benefit from USE_BRANCH=true, which you'd think is strange because it's generating more code for both cases, but the optimizer is able to unroll loops better when it can promote it to an unmask load. I don't have a good variety of benchmarks though, and I'm not sure what's the best way to explore our options here.

As for the interface potentially not being the final one -- I find that in real code using this, you're often recomputing your own effective Mask that was used for the load/store, as the functions only return the vector, if they return anything at all. This is probably inefficient because user code might compute them slightly differently and the final code will have two computations which are effectively the same? To improve on this, we might want to return the mask that was used by the library - basically with the effects of bounds-checking applied to the user-provided mask.

calebzulawski commented 11 months ago

Regarding fuzz testing we have them implemented via wrappers around proptest. You'll see the helpers in the test_helpers crate. This will be a little more difficult, since it involves pointers.

workingjubilee commented 9 months ago

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

@workingjubilee Is that the approach we want to take, given it will have an impact on the trait bounds we publish for these new functions? If we focused on USE_BRANCH=true and especially USE_BITMASK=true/false, we could simplify the trait bounds. Unless it's not a problem if we use a Sealed trait, but I don't usually write library code so I wouldn't know.

Oh, I hadn't thought about that. The Sealed trait pattern still leaves those bounds as visible and thus something a user can depend on, even if we can relax bounds later, so we should eschew anything that affects the bounds.

calebzulawski commented 9 months ago

I think the direction I would like to see these changes go (in this PR or in a new one, your preference) is for the initial implementation to have no x86 specific code and no bitmask tricks, and no extra bounds on the functions, and then develop it from there.

I don't disagree that there may be better codegen possibilities (though I'd like to see a little more evidence of a particular implementation being faster than another, since the supposedly slower implementation actually has a lower instruction count), but I am getting the feeling that this could be better implemented as improvements to the LLVM intrinsics, or at least in the rustc intrinsics, rather than in library code.

farnoy commented 9 months ago

That's fair. Should we worry about backwards compatibility if we introduce a function now but then add extra bounds later? I know that it's an unstable feature for now, just want to confirm that's fine.

Sorry about the delay too. My priorities at work changed and this is not blocking me for now. I still want to ship this as I'll have other uses for it.

calebzulawski commented 9 months ago

No worries--I've been too busy to focus much on std::simd lately as well. My goal is that there should be effectively no bounds on the functions (which is true of all of the other functions for the most part). This is one of the benefits of fixing the codegen on the LLVM or rust intrinsic side--there are no additional rust bounds to expose. However, if that's not possible for some reason, it may still be possible to avoid the bounds but still implement this within std::simd by using the intrinsics directly rather than Simd's functions.

jhorstmann commented 9 months ago

Of note is that llvm has separate intrinsics for masked loads with an additional len parameter: https://llvm.org/docs/LangRef.html#vector-predication-intrinsics

In my experiments these generate nice code for avx512, and with a mask of all ones and only a len, also good code for avx2 and 32/64 bit values: https://llvm.godbolt.org/z/Yx6fKM853

But I would agree to start with a simpler version, and maybe initially just offer functions taking a mask parameter.

A pattern I often use in my code is to have a helper function that processes a chunk of data, and which has a const MASKED: bool generic parameter. Based on that parameter I then either use normal or masked loads, and call the masked version only for the remainder chunk that is smaller than the vector size. Calculating the load mask is then relatively simple, for example for an 8bit bitmask u8::MAX >> (8 - len). Since this is the remainder chunk, the len is guaranteed to be less than 8.

workingjubilee commented 9 months ago

aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAA

hi just wanted to let you know I sympathize with

Sorry about the delay too.

No worries--I've been too busy to focus much on std::simd lately as well.

anyways, re:

My goal is that there should be effectively no bounds on the functions (which is true of all of the other functions for the most part).

yeah, ideally. I'd rather any bounds be the kind that type inference can easily take care of 99.9% of the time.

farnoy commented 8 months ago

I'll close this in favor of the simplified PR I just opened.

I agree with letting LLVM optimize it. If someone needs performant code today, they probably need to use the unsafe functions and calculate bitmasks themselves.