Implement boolean vectors

rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection

https://doc.rust-lang.org/stable/core/arch/

Apache License 2.0

612 stars 269 forks source link

Implement boolean vectors #185

Closed gnzlbg closed 6 years ago

gnzlbg commented 7 years ago

I've come around (thanks @hsivonen for being persuasive) and I think it makes sense to implement boolean vectors as part of stdsimd.

I think we could:

[ ] implement boolean vectors like in the simd crate (we don't need to be 1:1 on par in functionality on the initial implementation, we can add more features to boolean vectors later in a backwards compatible way)
[ ] make the comparison operators use them (these won't be stabilized in the first round anyways)
[ ] review all the intrinsics and decide on a case-by-case basis whether to use boolean vectors or not .

I think that none of this must block stabilization (although @hsivonen might disagree). First, IIUC the time-line is as follows: someday somebody will integrate this as part of core and stdsimd will be shipped to nightly users behind some feature flag. We should try hard to not break stuff from then on, but we can technically still break some stuff (like changing the return type of an unstable intrinsic to use boolean vectors). One release cycle later we will want to start stabilizing stuff. For that we are probably having to white-list intrinsics as stable on a case-by-case basis. Once we have done that we need to write an RFC with all those intrinsics inside and submit it to the RFC repo.

If we feel that boolean vectors aren't ready by then we don't stabilize them nor any intrinsic that we think might want to use them in the future. But I'd like to defer that discussion until then. For what we know boolean vectors might turn out to be ready, and we might want to just ship them.

Thoughts?

BurntSushi commented 7 years ago

I'm not convinced that we should be adding boolean vectors to the vendor intrinsics. Vendor intrinsics are supposed to be a low level interface to vendor specific APIs, and none of the vendor specific APIs (at least for Intel) have a concept of a boolean vector. One may be implied by the stated contracts on a certain vendor intrinsics, but that means we need to go through all of them and vet them ourselves. We're already trying to improve the state of the world a little bit with more granular integer vector types, but only because others have (for the most part) already done that work for us.

I grant that boolean vectors may be a nice option for the portable API.

hsivonen commented 7 years ago

A simd PR is probably of interest when implementing this.

I think that none of this must block stabilization (although @hsivonen might disagree).

It seems like a problem if the return type of the comparison methods or vendor operations changes on stable.

Additionally to what the initial comment says, I think it would be good to have API surface for safely bitcasting bNxM to uNxM and iNxM (possibly even fNxM). The reverse should be unsafe, obviously.

hsivonen commented 7 years ago

none of the vendor specific APIs (at least for Intel) have a concept of a boolean vector

Every operation (including in the Intel case) whose output is defined to set all bits of a given lane to either zero or one conceptually returns a boolean vector even if the concept isn't a stated part of the Intel C API.

As for inputs, safe and convenient bitcasting from bNxM to uNxM and iNxM is needed so that boolean vectors don't hinder their usage as inputs to operations that don't care about the precondition that each lane is either all-zero or all-one.

gnzlbg commented 6 years ago

@hsivonen

It seems like a problem if the return type of the comparison methods or vendor operations changes on stable.

What I was thinking is not stabilizing those intrinsics where we are unsure if we might want to make them return boolean vectors. That would mean that one must use rust nightly to use those, and the rust nightly code might break if we switch them to boolean vectors.

@BurntSushi

but that means we need to go through all of them and vet them ourselves.

Yes, that is what it would mean.

So I wanted to do this, but before doing so, I wanted to cheat a bit and look at the simd crate for examples of intrinsics in which boolean vector types are used there. The simd crate has following boolean types:

bool8i(i8), bool16i(i16), bool32i(i32), bool32f(i32)
bool32ix2(i32, i32), bool32fx2(i32, i32), bool32fx4(i32, i32, i32, i32), bool32ix4(i32, i32, i32, i32), bool16ix8(i16, ..., i16), bool8ix16(i8, ..., i8)

But... I cannot find a single intrinsic in the whole simd crate that uses them

The simd crate defines them as something extra, and implements some portable operations for them, but doesn't really use them. So... I don't really see the point of adding them if they are not used as the return type of some intrinsics.

@hsivonen is this correct? If so, I don't understand why they can't be provided by a different crate.

If what you wanted is for them to be used as the return type of some intrinsics, then I think you need to make a better case and tells us that a sufficient amount of intrinsics require them (e.g. by reviewing the x86 simd intrinsics and letting us know what you find).

hsivonen commented 6 years ago

The simd crate defines them as something extra, and implements some portable operations for them, but doesn't really use them. So... I don't really see the point of adding them if they are not used as the return type of some intrinsics.

@hsivonen is this correct?

No. The simd crate uses them as return types for comparisons. The corresponding intrinsics are simd_le, simd_eq, etc.

hsivonen commented 6 years ago

As for examples of vendor intrinsics that platform-intrinsics (simd_le, simd_eq, etc.) don't abstract over but that logically return boolean vectors, a quick look at the SSE family yields NaN checks: _mm_cmpord_ps.

gnzlbg commented 6 years ago

As for examples of vendor intrinsics that platform-intrinsics (simd_le, simd_eq, etc.) don't abstract over but that logically return boolean vectors, a quick look at the SSE family yields NaN checks: _mm_cmpord_ps.

It would be nice to have a comprehensive list so that we can make an informed decision.

gnzlbg commented 6 years ago

Boolean vectors have been implemented in #338. Only b8xN types are provided (e.g. wider types like b32x4 are not provided).

I've taken a look at the assembly generated for the SSE2 mm_cmpeq_epi16 intrinsic

// Compares two vectors of packed 16-bit integers
__m128i _mm_cmpeq_epi16 (__m128i a, __m128i b)

which returns an __m128i that, when transmuted into a i16x8, has lanes that only contain either 0x00 (false) or 0xff (true).

First I consider these two cases:

pub unsafe fn foo(x: i16) -> __m128i {
    let a = _mm_setr_epi16(x, 1, 2, 3, 4, 5, 6, 7);
    let b = _mm_setr_epi16(7, x, 2, 4, 3, 2, 1, 0);
    _mm_cmpeq_epi16(a, b)
}

pub unsafe fn bar(x: i16) -> i16x8 {
    let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
    let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
    _mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits()
}

which produce identical assembly - i16x8 has no overhead, as expected, over __m128i. Then I switched bar to output only whether the third element is true or false, and added another function that does the same using boolean vectors:


pub unsafe fn bar2(x: i16) -> bool {
    let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
    let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
    let b: i16x8 = _mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits();
    b.extract(3) == 0xffff_i16
}

pub unsafe fn baz(x: i16) -> bool {
    let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
    let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
    let b = b8x8::from_bits(i8x8::from(i16x8::from_bits(_mm_cmpeq_epi16(a.into_bits(), b.into_bits()))));
    b.extract(3)
}

The functions bar2 and baz produce identical assembly as well. If we switch these functions to return the vectors, then the assembly differs slightly, because the types being returned are different, but they both look equally expensive.

@rkruppe pointed out that LLVM comparisons actually return <i1 x N>, but the best we can do in Rust is probably store them as <i8 x N>. AFAICT the b8xN vectors should be enough, but if we ever need wider boolean vectors we can always add them later.

hsivonen commented 6 years ago

Boolean vectors have been implemented in #338.

Thank you!

Only b8xN types are provided (e.g. wider types like b32x4 are not provided).

Why? I ported encoding_rs to stdsimd. The performance of the parts that use boolean vectors regressed significantly.

$ cargo benchcmp --threshold 25 --regressions trunk_2018-03-08.txt stdsimd_2018-03-08.txt 
 name                                               trunk_2018-03-08.txt ns/iter  stdsimd_2018-03-08.txt ns/iter  diff ns/iter  diff % 
 bench_encode_from_utf16_en                         5,730 (11294 MB/s)            7,836 (8259 MB/s)                      2,106  36.75% 
 bench_encode_from_utf16_en_windows_1252            6,527 (9925 MB/s)             8,434 (7681 MB/s)                      1,907  29.22% 
 bench_encode_from_utf16_jquery                     6,347 (13661 MB/s)            8,866 (9779 MB/s)                      2,519  39.69% 
 bench_encode_from_utf16_jquery_windows_1252        5,853 (14814 MB/s)            7,923 (10943 MB/s)                     2,070  35.37% 
 bench_mem_check_utf16_for_latin1_and_bidi_de_1000  44,481 (22481 MB/s)           73,276 (13647 MB/s)                   28,795  64.74% 
 bench_mem_check_utf16_for_latin1_and_bidi_ja_1000  125,719 (7954 MB/s)           183,051 (5462 MB/s)                   57,332  45.60% 
 bench_mem_check_utf16_for_latin1_and_bidi_th_1000  89,725 (11145 MB/s)           162,244 (6163 MB/s)                   72,519  80.82% 
 bench_mem_convert_utf16_to_str_ascii_1000          43,173 (23162 MB/s)           59,020 (16943 MB/s)                   15,847  36.71% 
 bench_mem_convert_utf16_to_utf8_ascii_1000         40,040 (24975 MB/s)           55,908 (17886 MB/s)                   15,868  39.63% 
 bench_mem_copy_basic_latin_to_ascii_1000           34,685 (28830 MB/s)           51,144 (19552 MB/s)                   16,459  47.45% 
 bench_mem_copy_basic_latin_to_ascii_16             1,308 (12232 MB/s)            1,669 (9586 MB/s)                        361  27.60% 
 bench_mem_ensure_utf16_validity_ascii_1000         49,748 (20101 MB/s)           71,781 (13931 MB/s)                   22,033  44.29% 
 bench_mem_ensure_utf16_validity_bmp_1000           49,752 (20099 MB/s)           71,720 (13943 MB/s)                   21,968  44.16% 
 bench_mem_ensure_utf16_validity_latin1_1000        49,748 (20101 MB/s)           72,688 (13757 MB/s)                   22,940  46.11% 
 bench_mem_is_basic_latin_false_15                  1,333 (11252 MB/s)            1,935 (7751 MB/s)                        602  45.16% 
 bench_mem_is_basic_latin_true_1000                 14,957 (66858 MB/s)           23,125 (43243 MB/s)                    8,168  54.61% 
 bench_mem_is_str_latin1_true_3                     1,024 (1464 MB/s)             1,936 (774 MB/s)                         912  89.06% 
 bench_mem_is_str_latin1_true_30                    3,558 (4215 MB/s)             4,563 (3287 MB/s)                      1,005  28.25% 
 bench_mem_is_utf16_bidi_de_1000                    38,694 (25843 MB/s)           71,216 (14041 MB/s)                   32,522  84.05% 
 bench_mem_is_utf16_bidi_ja_1000                    125,718 (7954 MB/s)           183,042 (5463 MB/s)                   57,324  45.60% 
 bench_mem_is_utf16_bidi_ru_1000                    61,462 (16270 MB/s)           96,280 (10386 MB/s)                   34,818  56.65% 
 bench_mem_is_utf16_bidi_th_1000                    89,466 (11177 MB/s)           162,250 (6163 MB/s)                   72,784  81.35% 
 bench_mem_is_utf16_latin1_true_1000                14,883 (67190 MB/s)           25,335 (39471 MB/s)                   10,452  70.23% 
 bench_mem_is_utf8_latin1_true_15                   4,023 (1864 MB/s)             5,660 (1325 MB/s)                      1,637  40.69% 
 bench_mem_utf16_valid_up_to_ascii_1000             49,362 (20258 MB/s)           72,247 (13841 MB/s)                   22,885  46.36% 
 bench_mem_utf16_valid_up_to_bmp_1000               49,362 (20258 MB/s)           72,216 (13847 MB/s)                   22,854  46.30% 
 bench_mem_utf16_valid_up_to_latin1_1000            49,363 (20258 MB/s)           72,247 (13841 MB/s)                   22,884  46.36%

(The results being compared are merges of the fastest results of four runs with the simd crate and four runs with the stdsimd crate. The numbers are from Intel Core i7-4770 @ 3.40 GHz.)

To have the kind of performance that one would expect, it seems to me that an operation that returns a boolean vector should return a vector of the same bit width as the operands and the any() and all() reductions should be as simple as the ISA allows as seen in the simd crate. (I'll file a different issue about the latter.)

hsivonen commented 6 years ago

I'll file a different issue about the latter.

Filed #362.

hsivonen commented 6 years ago

The performance of the parts that use boolean vectors regressed significantly.

Note that these operate on UTF-16, so they compare u16x8 with lt. With simd, the result was bool16ix8. With stdsimd it's b8x8. I believe it needs to be b16x8 in order to not to make portable SIMD too slow to be competitive with architecture-specific code.

gnzlbg commented 6 years ago

@hsivonen

Only b8xN types are provided (e.g. wider types like b32x4 are not provided).

Why?

Because converting from wider types to b8xN should be either be very cheap, or free when done correctly.

In particular, boolean operations always return b8xN right now, so there is nothing to convert if you are doing a.lt(b). When you have a wider type, you would need to do a conversion, and then a boolean reduction. If LLVM doesn't do the reduction directly in the wider type, then that's IMO either an LLVM bug or a bug on our side.

hsivonen commented 6 years ago

Because converting from wider types to b8xN should be either be very cheap, or free when done correctly.

What does "correctly" mean in this context?

In particular, boolean operations always return b8xN right now, so there is nothing to convert if you are doing a.lt(b).

I don't understand. If a and b are u16x8, lt should compile as _mm_cmplt_epi16, whose output is 128 bits wide and matches the semantics of bool16ix8 from the simd crate. In stdsimd, at the moment, the output is declared as 64 bits wide, so any operation that actually forces LLVM to materialize the 64-bit representation somewhere involves a conversion.

If LLVM doesn't do the reduction directly in the wider type, then that's IMO either an LLVM bug or a bug on our side.

The pattern I'm using doesn't logically involve LLVM having to materialize the 64-bit type in RAM. Logically, everything should happen in 128-bit registers:

        #[inline(always)]
        pub fn simd_is_basic_latin(s: u16x8) -> bool {
            let above_ascii = u16x8::splat(0x80);
            s.lt(above_ascii).all()
        }

Yet, performance suffers when going from simd to stdsimd.

It seems very brittle to me to define a narrower type (64 bits) and hope that LLVM never needs to actually materialize it as a 64-bit representation and operations actually get done in 128-bit registers.

In some sense, part of the problem here is that all types need to have a concrete memory representation. After all, logically in this case there are just 8 bits we care about, not even 64, but it's crucial for performance that the 8 bits are stretched across a 128-bit register. It seems to me that design used in the simd crate where the memory representation is declared as 128 bits wide when we want it to have a 128-bit register representation is the more robust way to go in practice. At least at present, the benchmarks are firmly in support of the design used in the simd crate.

hsivonen commented 6 years ago

To improve my understanding from a different angle:

What concrete problem (apart from the abstract problem of the number of types proliferating) is solved by having a single boolean vector type for a given lane count (as in stdsimd today) compared to having a boolean vector type for each lane count and lane width combination (is in simd)?

AFAICT, in realistic use cases, one wouldn't typically need to have boolean vectors of the same lane count but different lane widths interact without conversion ceremony.

gnzlbg commented 6 years ago

What does "correctly" mean in this context?

In a way that conveys to LLVM that these vectors are boolean vectors.

If a and b are u16x8, lt should compile as _mm_cmplt_epi16, whose output is 128 bits wide and matches the semantics of bool16ix8 from the simd crate.

It should compile to whatever instruction sequence makes your code run faster. If you don't use the result of that comparison, it should compile to nothing.

Yet, performance suffers when going from simd to stdsimd.

As mentioned in #362 this is currently an expected performance bug that needs to be fixed.

What concrete problem (apart from the abstract problem of the number of types proliferating) is solved by having a single boolean vector type for a given lane count (as in stdsimd today) compared to having a boolean vector type for each lane count and lane width combination (is in simd)?

It makes it easier for LLVM to optimize sequences of SIMD instructions across all ISAs.

hsivonen commented 6 years ago

What does "correctly" mean in this context?

In a way that conveys to LLVM that these vectors are boolean vectors.

Can we not convey to LLVM that b16x8 is a boolean vector?

Having a type whose declared bit width differs from the bit width that is performant on the machine still scares me. Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.

encoding_rs doesn't have this code yet, but consider this SIMD implementation of converting x-user-defined to UTF-16 using the simd crate:

#[inline(always)]
fn shift_upper(s: u16x8) -> u16x8 {
  let highest_ascii = u16x8::splat(0x7F);
  let offset = u16x8::splat(0xF700);
  let mask = s.gt(highest_ascii).to_repr().to_u16();
  s + (offset & mask)
}

#[inline(always)]
fn decode_x_user_defined_to_utf16(b: u8x16) -> (u16x8, u16x8) {
  let (a, b) = unpack(b); // zero extends one u8x16 into two u16x8s
  (shift_upper(a), shift_upper(b))
}

With the simd crate it's obvious that the .to_repr().to_u16() piece is purely a compile-time type system-level thing that's a no-op for ISA purposes. I don't need to convince myself that the compiler is Sufficiently Smart to make it a no-op. (It would be nicer, though, if I could get the u16x8 view with one method call instead of chain of two.)

With stdsimd, it's not even obvious at the API level how I could turn a b8x8 into a u16x8 whose each lane is 0 or 0xFFFF, so I can't even get to the point of pondering if the compiler is Sufficiently Smart to optimize it all away so that everything that happens on the ISA level is the lane-wise greater-than and the next operation is the bit-wise AND on the register contents obtained from the lane-wise greater-than.

It may look like I'm flipflopping by first complaining that comparisons don't return boolean vectors and then turning around and complaining that I don't get the result as integer lanes, but I have previously emphasized that it's both important that comparisons signal on the type system level that all bits on each lane are either set or unset and that it is important that this type layer can be waived at zero cost to get at the actual bits.

It should compile to whatever instruction sequence makes your code run faster.

Is there a concrete example of this within the bounds of the target having the same register width as the code uses? (I.e. it's a priority for me that u16x8 complied to SSE2 or NEON is fast. It's less important whether u16x8 auto-unrolls to u16x16 when compiling for an AVX-enabled target.)

Superficially, it seems unbelievable that any() and all() could compile to anything faster than the manual implementations that the simd crate provides for SSE2 or aarch64 when targeting SSE2 or aarch64, respectively. (In both cases, a single vector instruction that produces a value in an ALU register followed by an ALU comparison of that value with a constant.)

It makes it easier for LLVM to optimize sequences of SIMD instructions across all ISAs.

Considering that sequences of operations on boolean vectors are bitwise ops on the whole register and the simd crate hand-rolled any() and all() for SSE2 and aarch64 as simple as described above, it's hard to see the optimization opportunity (excluding unrolling to AVX). Is there a concrete example?

As mentioned in #362 this is currently an expected performance bug that needs to be fixed.

It's unclear to me if you mean a performance bug in the sense that you have something simple in mind that makes stdsimd match the speed of simd with LLVM 6 or if you mean in the sense of LLVM needing to become Sufficiently Smart.

gnzlbg commented 6 years ago

Can we not convey to LLVM that b16x8 is a boolean vector?

Yes, we can. As stated previously, we are currently not doing it. So this is why this bug happens.

Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.

If it isn't zero-cost, then it's a bug.

Is there a concrete example of this within the bounds of the target having the same register width as the code uses?

If you do operations on floating point vectors and them move them to integer vectors to do some more operations LLVM does often performs the floating point operations on integer vectors directly to avoid the latency of switching domains on the CPU. This can implicitly happen if you return a floating point vector from a function, for example, because they must be passed around as integer vectors due to the ABI.

It's unclear to me if you mean a performance bug in the sense that you have something simple in mind that makes stdsimd match the speed of simd with LLVM 6 or if you mean in the sense of LLVM needing to become Sufficiently Smart.

LLVM has a function that does and reductions on vectors called llvm.experimental.vector.reduce.and. This function works on any vector.

When you produce a boolean vector using a comparison, like a.gt(b), Rust calls llvm.experimental.vecto.reduce.and(<i1 x N>) and LLVM generates good code.

When you don't produce a boolean vector using a comparison, Rust calls llvm.experimental.vector.reduce.and(<iM x N>) for M > 1, and LLVM generates code for integer vectors, which is bad code if your vector is a boolean vector.

This is the bug.

Also, this bug is completely independent of the width of a boolean vector. Even if we added b32x8 we would still get bad code-gen if we called llvm.experimental.vector.reduce.and(<i32 x 8>) instead of llvm.experimental.vector.reduce.and(<i1 x 8>) .

So "Do we need wider boolean vector types?" and "boolean reductions produce bad codegen" are two completely unrelated issues.

gnzlbg commented 6 years ago

If you want to argue for wider boolean vectors here using examples of bad codegen to motivate their need, you should be using examples that do not trigger the codegen bug. Otherwise we can't tell whether the bad codegen is due to the bug, or due to the type not being wide enough.

hsivonen commented 6 years ago

If you want to argue for wider boolean vectors here using examples of bad codegen to motivate their need, you should be using examples that do not trigger the codegen bug.

I intentionally filed the reduction perf bug separately and showed a bitmask example when discussing width.

Regardless reductions and regardless of the bitmask case, which I haven't benchmarked (due to not knowing the answer to the question below), I'm worried that having a type whose memory representation differs from what can be loaded into the efficient register representation using movdqa in the SSE2 case will be bad in cases where the type needs to be materialized in memory. Making the Rust ABI pass SIMD types on the stack instead of registers seems like a notable potential source of having to materialize the type in memory.

Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.

If it isn't zero-cost, then it's a bug.

With the stdsimd API, what's the correct way to go from a b8x8 to a u16x8 bitmask?

gnzlbg commented 6 years ago

With the stdsimd API, what's the correct way to go from a b8x8 to a u16x8 bitmask?

From/Into is the right way to do that.

I'm worried that having a type whose memory representation differs from what can be loaded into the efficient register representation using movdqa in the SSE2 case will be bad in cases where the type needs to be materialized in memory.

I understand the worry, but right now it is just that: a worry.

hsivonen commented 6 years ago

With the stdsimd API, what's the correct way to go from a b8x8 to a u16x8 bitmask?

From/Into is the right way to do that.

I don't see this case implemented. What am I missing?

error[E0277]: the trait bound `stdsimd::simd::u16x8: std::convert::From<stdsimd::simd::b8x8>` is not satisfied
  --> src/x_user_defined.rs:23:58
   |
23 |             let mask: u16x8 = unpacked.gt(highest_ascii).into();
   |                                                          ^^^^ the trait `std::convert::From<stdsimd::simd::b8x8>` is not implemented for `stdsimd::simd::u16x8`
   |
   = help: the following implementations were found:
             <stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::f64x8>>
             <stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::u32x8>>
             <stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::i32x8>>
             <stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::i16x8>>
           and 5 others
   = note: required because of the requirements on the impl of `std::convert::Into<stdsimd::simd::u16x8>` for `stdsimd::simd::b8x8`

gnzlbg commented 6 years ago

I don't see this case implemented. What am I missing?

Nothing, the error is correct, the implementation is currently missing. Looks like an oversight. EDIT: filled #370 to track this.

gnzlbg commented 6 years ago

This has been implemented already.