Closed gnzlbg closed 6 years ago
I'm not convinced that we should be adding boolean vectors to the vendor intrinsics. Vendor intrinsics are supposed to be a low level interface to vendor specific APIs, and none of the vendor specific APIs (at least for Intel) have a concept of a boolean vector. One may be implied by the stated contracts on a certain vendor intrinsics, but that means we need to go through all of them and vet them ourselves. We're already trying to improve the state of the world a little bit with more granular integer vector types, but only because others have (for the most part) already done that work for us.
I grant that boolean vectors may be a nice option for the portable API.
A simd
PR is probably of interest when implementing this.
I think that none of this must block stabilization (although @hsivonen might disagree).
It seems like a problem if the return type of the comparison methods or vendor operations changes on stable.
Additionally to what the initial comment says, I think it would be good to have API surface for safely bitcasting bNxM
to uNxM
and iNxM
(possibly even fNxM
). The reverse should be unsafe
, obviously.
none of the vendor specific APIs (at least for Intel) have a concept of a boolean vector
Every operation (including in the Intel case) whose output is defined to set all bits of a given lane to either zero or one conceptually returns a boolean vector even if the concept isn't a stated part of the Intel C API.
As for inputs, safe and convenient bitcasting from bNxM
to uNxM
and iNxM
is needed so that boolean vectors don't hinder their usage as inputs to operations that don't care about the precondition that each lane is either all-zero or all-one.
@hsivonen
It seems like a problem if the return type of the comparison methods or vendor operations changes on stable.
What I was thinking is not stabilizing those intrinsics where we are unsure if we might want to make them return boolean vectors. That would mean that one must use rust nightly to use those, and the rust nightly code might break if we switch them to boolean vectors.
@BurntSushi
but that means we need to go through all of them and vet them ourselves.
Yes, that is what it would mean.
So I wanted to do this, but before doing so, I wanted to cheat a bit and look at the simd
crate for examples of intrinsics in which boolean vector types are used there. The simd
crate has following boolean types:
bool8i(i8)
, bool16i(i16)
, bool32i(i32)
, bool32f(i32)
bool32ix2(i32, i32)
, bool32fx2(i32, i32)
, bool32fx4(i32, i32, i32, i32)
, bool32ix4(i32, i32, i32, i32)
, bool16ix8(i16, ..., i16)
, bool8ix16(i8, ..., i8)
But... I cannot find a single intrinsic in the whole simd
crate that uses them
The simd
crate defines them as something extra, and implements some portable operations for them, but doesn't really use them. So... I don't really see the point of adding them if they are not used as the return type of some intrinsics.
@hsivonen is this correct? If so, I don't understand why they can't be provided by a different crate.
If what you wanted is for them to be used as the return type of some intrinsics, then I think you need to make a better case and tells us that a sufficient amount of intrinsics require them (e.g. by reviewing the x86 simd intrinsics and letting us know what you find).
The simd crate defines them as something extra, and implements some portable operations for them, but doesn't really use them. So... I don't really see the point of adding them if they are not used as the return type of some intrinsics.
@hsivonen is this correct?
No. The simd
crate uses them as return types for comparisons. The corresponding intrinsics are simd_le
, simd_eq
, etc.
As for examples of vendor intrinsics that platform-intrinsics
(simd_le
, simd_eq
, etc.) don't abstract over but that logically return boolean vectors, a quick look at the SSE family yields NaN checks: _mm_cmpord_ps
.
As for examples of vendor intrinsics that platform-intrinsics (simd_le, simd_eq, etc.) don't abstract over but that logically return boolean vectors, a quick look at the SSE family yields NaN checks: _mm_cmpord_ps.
It would be nice to have a comprehensive list so that we can make an informed decision.
Boolean vectors have been implemented in #338. Only b8xN
types are provided (e.g. wider types like b32x4
are not provided).
I've taken a look at the assembly generated for the SSE2 mm_cmpeq_epi16
intrinsic
// Compares two vectors of packed 16-bit integers
__m128i _mm_cmpeq_epi16 (__m128i a, __m128i b)
which returns an __m128i
that, when transmuted into a i16x8
, has lanes that only contain either 0x00
(false) or 0xff
(true).
First I consider these two cases:
pub unsafe fn foo(x: i16) -> __m128i {
let a = _mm_setr_epi16(x, 1, 2, 3, 4, 5, 6, 7);
let b = _mm_setr_epi16(7, x, 2, 4, 3, 2, 1, 0);
_mm_cmpeq_epi16(a, b)
}
pub unsafe fn bar(x: i16) -> i16x8 {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
_mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits()
}
which produce identical assembly - i16x8
has no overhead, as expected, over __m128i
. Then I switched bar to output only whether the third element is true
or false
, and added another function that does the same using boolean vectors:
pub unsafe fn bar2(x: i16) -> bool {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
let b: i16x8 = _mm_cmpeq_epi16(a.into_bits(), b.into_bits()).into_bits();
b.extract(3) == 0xffff_i16
}
pub unsafe fn baz(x: i16) -> bool {
let a = i16x8::new(x, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(7, x, 2, 4, 3, 2, 1, 0);
let b = b8x8::from_bits(i8x8::from(i16x8::from_bits(_mm_cmpeq_epi16(a.into_bits(), b.into_bits()))));
b.extract(3)
}
The functions bar2
and baz
produce identical assembly as well. If we switch these functions to return the vectors, then the assembly differs slightly, because the types being returned are different, but they both look equally expensive.
@rkruppe pointed out that LLVM comparisons actually return <i1 x N>
, but the best we can do in Rust is probably store them as <i8 x N>
. AFAICT the b8xN
vectors should be enough, but if we ever need wider boolean vectors we can always add them later.
Boolean vectors have been implemented in #338.
Thank you!
Only b8xN types are provided (e.g. wider types like b32x4 are not provided).
Why? I ported encoding_rs to stdsimd
. The performance of the parts that use boolean vectors regressed significantly.
$ cargo benchcmp --threshold 25 --regressions trunk_2018-03-08.txt stdsimd_2018-03-08.txt
name trunk_2018-03-08.txt ns/iter stdsimd_2018-03-08.txt ns/iter diff ns/iter diff %
bench_encode_from_utf16_en 5,730 (11294 MB/s) 7,836 (8259 MB/s) 2,106 36.75%
bench_encode_from_utf16_en_windows_1252 6,527 (9925 MB/s) 8,434 (7681 MB/s) 1,907 29.22%
bench_encode_from_utf16_jquery 6,347 (13661 MB/s) 8,866 (9779 MB/s) 2,519 39.69%
bench_encode_from_utf16_jquery_windows_1252 5,853 (14814 MB/s) 7,923 (10943 MB/s) 2,070 35.37%
bench_mem_check_utf16_for_latin1_and_bidi_de_1000 44,481 (22481 MB/s) 73,276 (13647 MB/s) 28,795 64.74%
bench_mem_check_utf16_for_latin1_and_bidi_ja_1000 125,719 (7954 MB/s) 183,051 (5462 MB/s) 57,332 45.60%
bench_mem_check_utf16_for_latin1_and_bidi_th_1000 89,725 (11145 MB/s) 162,244 (6163 MB/s) 72,519 80.82%
bench_mem_convert_utf16_to_str_ascii_1000 43,173 (23162 MB/s) 59,020 (16943 MB/s) 15,847 36.71%
bench_mem_convert_utf16_to_utf8_ascii_1000 40,040 (24975 MB/s) 55,908 (17886 MB/s) 15,868 39.63%
bench_mem_copy_basic_latin_to_ascii_1000 34,685 (28830 MB/s) 51,144 (19552 MB/s) 16,459 47.45%
bench_mem_copy_basic_latin_to_ascii_16 1,308 (12232 MB/s) 1,669 (9586 MB/s) 361 27.60%
bench_mem_ensure_utf16_validity_ascii_1000 49,748 (20101 MB/s) 71,781 (13931 MB/s) 22,033 44.29%
bench_mem_ensure_utf16_validity_bmp_1000 49,752 (20099 MB/s) 71,720 (13943 MB/s) 21,968 44.16%
bench_mem_ensure_utf16_validity_latin1_1000 49,748 (20101 MB/s) 72,688 (13757 MB/s) 22,940 46.11%
bench_mem_is_basic_latin_false_15 1,333 (11252 MB/s) 1,935 (7751 MB/s) 602 45.16%
bench_mem_is_basic_latin_true_1000 14,957 (66858 MB/s) 23,125 (43243 MB/s) 8,168 54.61%
bench_mem_is_str_latin1_true_3 1,024 (1464 MB/s) 1,936 (774 MB/s) 912 89.06%
bench_mem_is_str_latin1_true_30 3,558 (4215 MB/s) 4,563 (3287 MB/s) 1,005 28.25%
bench_mem_is_utf16_bidi_de_1000 38,694 (25843 MB/s) 71,216 (14041 MB/s) 32,522 84.05%
bench_mem_is_utf16_bidi_ja_1000 125,718 (7954 MB/s) 183,042 (5463 MB/s) 57,324 45.60%
bench_mem_is_utf16_bidi_ru_1000 61,462 (16270 MB/s) 96,280 (10386 MB/s) 34,818 56.65%
bench_mem_is_utf16_bidi_th_1000 89,466 (11177 MB/s) 162,250 (6163 MB/s) 72,784 81.35%
bench_mem_is_utf16_latin1_true_1000 14,883 (67190 MB/s) 25,335 (39471 MB/s) 10,452 70.23%
bench_mem_is_utf8_latin1_true_15 4,023 (1864 MB/s) 5,660 (1325 MB/s) 1,637 40.69%
bench_mem_utf16_valid_up_to_ascii_1000 49,362 (20258 MB/s) 72,247 (13841 MB/s) 22,885 46.36%
bench_mem_utf16_valid_up_to_bmp_1000 49,362 (20258 MB/s) 72,216 (13847 MB/s) 22,854 46.30%
bench_mem_utf16_valid_up_to_latin1_1000 49,363 (20258 MB/s) 72,247 (13841 MB/s) 22,884 46.36%
(The results being compared are merges of the fastest results of four runs with the simd
crate and four runs with the stdsimd
crate. The numbers are from Intel Core i7-4770 @ 3.40 GHz.)
To have the kind of performance that one would expect, it seems to me that an operation that returns a boolean vector should return a vector of the same bit width as the operands and the any()
and all()
reductions should be as simple as the ISA allows as seen in the simd
crate. (I'll file a different issue about the latter.)
I'll file a different issue about the latter.
Filed #362.
The performance of the parts that use boolean vectors regressed significantly.
Note that these operate on UTF-16, so they compare u16x8
with lt
. With simd
, the result was bool16ix8
. With stdsimd
it's b8x8
. I believe it needs to be b16x8
in order to not to make portable SIMD too slow to be competitive with architecture-specific code.
@hsivonen
Only b8xN types are provided (e.g. wider types like b32x4 are not provided).
Why?
Because converting from wider types to b8xN
should be either be very cheap, or free when done correctly.
In particular, boolean operations always return b8xN
right now, so there is nothing to convert if you are doing a.lt(b)
. When you have a wider type, you would need to do a conversion, and then a boolean reduction. If LLVM doesn't do the reduction directly in the wider type, then that's IMO either an LLVM bug or a bug on our side.
Because converting from wider types to
b8xN
should be either be very cheap, or free when done correctly.
What does "correctly" mean in this context?
In particular, boolean operations always return
b8xN
right now, so there is nothing to convert if you are doinga.lt(b)
.
I don't understand. If a
and b
are u16x8
, lt
should compile as _mm_cmplt_epi16
, whose output is 128 bits wide and matches the semantics of bool16ix8
from the simd
crate. In stdsimd
, at the moment, the output is declared as 64 bits wide, so any operation that actually forces LLVM to materialize the 64-bit representation somewhere involves a conversion.
If LLVM doesn't do the reduction directly in the wider type, then that's IMO either an LLVM bug or a bug on our side.
The pattern I'm using doesn't logically involve LLVM having to materialize the 64-bit type in RAM. Logically, everything should happen in 128-bit registers:
#[inline(always)]
pub fn simd_is_basic_latin(s: u16x8) -> bool {
let above_ascii = u16x8::splat(0x80);
s.lt(above_ascii).all()
}
Yet, performance suffers when going from simd
to stdsimd
.
It seems very brittle to me to define a narrower type (64 bits) and hope that LLVM never needs to actually materialize it as a 64-bit representation and operations actually get done in 128-bit registers.
In some sense, part of the problem here is that all types need to have a concrete memory representation. After all, logically in this case there are just 8 bits we care about, not even 64, but it's crucial for performance that the 8 bits are stretched across a 128-bit register. It seems to me that design used in the simd
crate where the memory representation is declared as 128 bits wide when we want it to have a 128-bit register representation is the more robust way to go in practice. At least at present, the benchmarks are firmly in support of the design used in the simd
crate.
To improve my understanding from a different angle:
What concrete problem (apart from the abstract problem of the number of types proliferating) is solved by having a single boolean vector type for a given lane count (as in stdsimd
today) compared to having a boolean vector type for each lane count and lane width combination (is in simd
)?
AFAICT, in realistic use cases, one wouldn't typically need to have boolean vectors of the same lane count but different lane widths interact without conversion ceremony.
What does "correctly" mean in this context?
In a way that conveys to LLVM that these vectors are boolean vectors.
If a and b are u16x8, lt should compile as _mm_cmplt_epi16, whose output is 128 bits wide and matches the semantics of bool16ix8 from the simd crate.
It should compile to whatever instruction sequence makes your code run faster. If you don't use the result of that comparison, it should compile to nothing.
Yet, performance suffers when going from simd to stdsimd.
As mentioned in #362 this is currently an expected performance bug that needs to be fixed.
What concrete problem (apart from the abstract problem of the number of types proliferating) is solved by having a single boolean vector type for a given lane count (as in stdsimd today) compared to having a boolean vector type for each lane count and lane width combination (is in simd)?
It makes it easier for LLVM to optimize sequences of SIMD instructions across all ISAs.
What does "correctly" mean in this context?
In a way that conveys to LLVM that these vectors are boolean vectors.
Can we not convey to LLVM that b16x8
is a boolean vector?
Having a type whose declared bit width differs from the bit width that is performant on the machine still scares me. Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.
encoding_rs doesn't have this code yet, but consider this SIMD implementation of converting x-user-defined to UTF-16 using the simd
crate:
#[inline(always)]
fn shift_upper(s: u16x8) -> u16x8 {
let highest_ascii = u16x8::splat(0x7F);
let offset = u16x8::splat(0xF700);
let mask = s.gt(highest_ascii).to_repr().to_u16();
s + (offset & mask)
}
#[inline(always)]
fn decode_x_user_defined_to_utf16(b: u8x16) -> (u16x8, u16x8) {
let (a, b) = unpack(b); // zero extends one u8x16 into two u16x8s
(shift_upper(a), shift_upper(b))
}
With the simd
crate it's obvious that the .to_repr().to_u16()
piece is purely a compile-time type system-level thing that's a no-op for ISA purposes. I don't need to convince myself that the compiler is Sufficiently Smart to make it a no-op. (It would be nicer, though, if I could get the u16x8
view with one method call instead of chain of two.)
With stdsimd
, it's not even obvious at the API level how I could turn a b8x8
into a u16x8
whose each lane is 0 or 0xFFFF, so I can't even get to the point of pondering if the compiler is Sufficiently Smart to optimize it all away so that everything that happens on the ISA level is the lane-wise greater-than and the next operation is the bit-wise AND on the register contents obtained from the lane-wise greater-than.
It may look like I'm flipflopping by first complaining that comparisons don't return boolean vectors and then turning around and complaining that I don't get the result as integer lanes, but I have previously emphasized that it's both important that comparisons signal on the type system level that all bits on each lane are either set or unset and that it is important that this type layer can be waived at zero cost to get at the actual bits.
It should compile to whatever instruction sequence makes your code run faster.
Is there a concrete example of this within the bounds of the target having the same register width as the code uses? (I.e. it's a priority for me that u16x8
complied to SSE2 or NEON is fast. It's less important whether u16x8
auto-unrolls to u16x16
when compiling for an AVX-enabled target.)
Superficially, it seems unbelievable that any()
and all()
could compile to anything faster than the manual implementations that the simd
crate provides for SSE2 or aarch64 when targeting SSE2 or aarch64, respectively. (In both cases, a single vector instruction that produces a value in an ALU register followed by an ALU comparison of that value with a constant.)
It makes it easier for LLVM to optimize sequences of SIMD instructions across all ISAs.
Considering that sequences of operations on boolean vectors are bitwise ops on the whole register and the simd
crate hand-rolled any()
and all()
for SSE2 and aarch64 as simple as described above, it's hard to see the optimization opportunity (excluding unrolling to AVX). Is there a concrete example?
As mentioned in #362 this is currently an expected performance bug that needs to be fixed.
It's unclear to me if you mean a performance bug in the sense that you have something simple in mind that makes stdsimd
match the speed of simd
with LLVM 6 or if you mean in the sense of LLVM needing to become Sufficiently Smart.
Can we not convey to LLVM that b16x8 is a boolean vector?
Yes, we can. As stated previously, we are currently not doing it. So this is why this bug happens.
Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.
If it isn't zero-cost, then it's a bug.
Is there a concrete example of this within the bounds of the target having the same register width as the code uses?
If you do operations on floating point vectors and them move them to integer vectors to do some more operations LLVM does often performs the floating point operations on integer vectors directly to avoid the latency of switching domains on the CPU. This can implicitly happen if you return a floating point vector from a function, for example, because they must be passed around as integer vectors due to the ABI.
It's unclear to me if you mean a performance bug in the sense that you have something simple in mind that makes stdsimd match the speed of simd with LLVM 6 or if you mean in the sense of LLVM needing to become Sufficiently Smart.
LLVM has a function that does and
reductions on vectors called llvm.experimental.vector.reduce.and
. This function works on any vector.
When you produce a boolean vector using a comparison, like a.gt(b)
, Rust calls llvm.experimental.vecto.reduce.and(<i1 x N>)
and LLVM generates good code.
When you don't produce a boolean vector using a comparison, Rust calls llvm.experimental.vector.reduce.and(<iM x N>)
for M > 1, and LLVM generates code for integer vectors, which is bad code if your vector is a boolean vector.
This is the bug.
Also, this bug is completely independent of the width of a boolean vector. Even if we added b32x8
we would still get bad code-gen if we called llvm.experimental.vector.reduce.and(<i32 x 8>)
instead of llvm.experimental.vector.reduce.and(<i1 x 8>)
.
So "Do we need wider boolean vector types?" and "boolean reductions produce bad codegen" are two completely unrelated issues.
If you want to argue for wider boolean vectors here using examples of bad codegen to motivate their need, you should be using examples that do not trigger the codegen bug. Otherwise we can't tell whether the bad codegen is due to the bug, or due to the type not being wide enough.
If you want to argue for wider boolean vectors here using examples of bad codegen to motivate their need, you should be using examples that do not trigger the codegen bug.
I intentionally filed the reduction perf bug separately and showed a bitmask example when discussing width.
Regardless reductions and regardless of the bitmask case, which I haven't benchmarked (due to not knowing the answer to the question below), I'm worried that having a type whose memory representation differs from what can be loaded into the efficient register representation using movdqa
in the SSE2 case will be bad in cases where the type needs to be materialized in memory. Making the Rust ABI pass SIMD types on the stack instead of registers seems like a notable potential source of having to materialize the type in memory.
Also, it worries me not to be able to be sure that viewing a boolean vector as a bitmask is a zero-cost transmute.
If it isn't zero-cost, then it's a bug.
With the stdsimd
API, what's the correct way to go from a b8x8
to a u16x8
bitmask?
With the stdsimd API, what's the correct way to go from a b8x8 to a u16x8 bitmask?
From
/Into
is the right way to do that.
I'm worried that having a type whose memory representation differs from what can be loaded into the efficient register representation using movdqa in the SSE2 case will be bad in cases where the type needs to be materialized in memory.
I understand the worry, but right now it is just that: a worry.
With the
stdsimd
API, what's the correct way to go from ab8x8
to au16x8
bitmask?
From
/Into
is the right way to do that.
I don't see this case implemented. What am I missing?
error[E0277]: the trait bound `stdsimd::simd::u16x8: std::convert::From<stdsimd::simd::b8x8>` is not satisfied
--> src/x_user_defined.rs:23:58
|
23 | let mask: u16x8 = unpacked.gt(highest_ascii).into();
| ^^^^ the trait `std::convert::From<stdsimd::simd::b8x8>` is not implemented for `stdsimd::simd::u16x8`
|
= help: the following implementations were found:
<stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::f64x8>>
<stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::u32x8>>
<stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::i32x8>>
<stdsimd::simd::u16x8 as std::convert::From<stdsimd::simd::i16x8>>
and 5 others
= note: required because of the requirements on the impl of `std::convert::Into<stdsimd::simd::u16x8>` for `stdsimd::simd::b8x8`
I don't see this case implemented. What am I missing?
Nothing, the error is correct, the implementation is currently missing. Looks like an oversight. EDIT: filled #370 to track this.
This has been implemented already.
I've come around (thanks @hsivonen for being persuasive) and I think it makes sense to implement boolean vectors as part of
stdsimd
.I think we could:
simd
crate (we don't need to be 1:1 on par in functionality on the initial implementation, we can add more features to boolean vectors later in a backwards compatible way)I think that none of this must block stabilization (although @hsivonen might disagree). First, IIUC the time-line is as follows: someday somebody will integrate this as part of
core
andstdsimd
will be shipped to nightly users behind some feature flag. We should try hard to not break stuff from then on, but we can technically still break some stuff (like changing the return type of an unstable intrinsic to use boolean vectors). One release cycle later we will want to start stabilizing stuff. For that we are probably having to white-list intrinsics as stable on a case-by-case basis. Once we have done that we need to write an RFC with all those intrinsics inside and submit it to the RFC repo.If we feel that boolean vectors aren't ready by then we don't stabilize them nor any intrinsic that we think might want to use them in the future. But I'd like to defer that discussion until then. For what we know boolean vectors might turn out to be ready, and we might want to just ship them.
Thoughts?