This PR adds a simd_extract_msbs intrinsic, along with an extract_msbs re-export to core:simd. This intrinsic extracts the most significant bit of each element of a #simd vector and packs them into a bit_set. This behavior is similar to the SSE/AVX movemask intrinsic/movmsk instruction. This intrinsic is defined similarly to:
simd_extract_msbs :: proc(a: #simd[N]T) -> bit_set[0..<N] where type_is_integer(T) || type_is_boolean(T)
since elements 0, 1, 4, and 7 have their most significant bits set (due to being negative).
This intrinsic is particularly useful in conjunction with lane-wise comparison masks, in a few particular use cases that I've found so far.
Counting matches. This can be useful on its own or in conjunction with masked_compress_store, for instance, where card(extract_msbs(v)) can be used to determine how many elements will be written. Examples:
count_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (result: int) {
// Counts the number of values in `src` that are in the range `[lower, upper].`
for v in src {
mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper)
result += card(simd.extract_msbs(mask))
}
return
}
filter_range :: proc (dst: ^[dynamic]f32, src: []#simd[16]f32, lower, upper: f32) {
// Appends to dst the values of src that are in the range [lower, upper].
for v in src {
mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper)
old_len := len(dst)
non_zero_resize(dst, old_len + card(simd.extract_msbs(mask)))
This use case *can* also be done via `-simd_reduce_add_ordered(mask)`, though I find using `extract_msbs` to be clearer on intent.
2. Identifying which elements matched, e.g. for searching or looping over them. Examples:
```go
first_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (at: int, found: bool) {
// Returns the (flattened) index of the first value in `src` that is in the range `[lower, upper]`.
for v, i in src {
matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper))
for j in matches {
at = 16*i + j
found = true
return
}
}
return
}
histogram_range :: proc (bins: []int, src: []#simd[16]f32, lower, upper: f32) {
// Generates histogram bin counts for the values in `src` in the range `[lower, upper)`.
// bins[0] will be the number of values that fall in the first 1/Nth of the range,
// bins[1] will be the number of values that fall in the second 1/Nth of the range, etc.
span := (upper - lower) / f32(len(bins))
for v in src {
matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_lt(v, auto_cast upper))
phases := (v - auto_cast lower) / auto_cast span
indices := cast(#simd[16]i32)phases
for j in matches {
// Can't be done in parallel due to potential intersection
i := int(simd.extract(indices, j))
bins[i] += 1
}
}
return
}
In more specific cases, the bit_set itself can also be directly useful for doing boolean logic en masse. Condensing the original checks to bit-masks allows the bit-masks themselves to be used in SIMD vectors to do large amounts of boolean logic at once.
This PR adds a
simd_extract_msbs
intrinsic, along with anextract_msbs
re-export tocore:simd
. This intrinsic extracts the most significant bit of each element of a#simd
vector and packs them into abit_set
. This behavior is similar to the SSE/AVXmovemask
intrinsic/movmsk
instruction. This intrinsic is defined similarly to:For example, this code:
will print:
since elements 0, 1, 4, and 7 have their most significant bits set (due to being negative).
This intrinsic is particularly useful in conjunction with lane-wise comparison masks, in a few particular use cases that I've found so far.
masked_compress_store
, for instance, wherecard(extract_msbs(v))
can be used to determine how many elements will be written. Examples:filter_range :: proc (dst: ^[dynamic]f32, src: []#simd[16]f32, lower, upper: f32) { // Appends to
dst
the values ofsrc
that are in the range[lower, upper].
for v in src { mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper) old_len := len(dst) non_zero_resize(dst, old_len + card(simd.extract_msbs(mask)))no_bounds_check { simd.masked_compress_store(&dst[old_len], v, mask) }
}
bit_set
itself can also be directly useful for doing boolean logic en masse. Condensing the original checks to bit-masks allows the bit-masks themselves to be used in SIMD vectors to do large amounts of boolean logic at once.