odin-lang / Odin

Odin Programming Language
https://odin-lang.org
BSD 3-Clause "New" or "Revised" License
7.01k stars 621 forks source link

SIMD Extract MSbs Intrinsic #4466

Open Barinzaya opened 2 weeks ago

Barinzaya commented 2 weeks ago

This PR adds a simd_extract_msbs intrinsic, along with an extract_msbs re-export to core:simd. This intrinsic extracts the most significant bit of each element of a #simd vector and packs them into a bit_set. This behavior is similar to the SSE/AVX movemask intrinsic/movmsk instruction. This intrinsic is defined similarly to:

simd_extract_msbs :: proc(a: #simd[N]T) -> bit_set[0..<N] where type_is_integer(T) || type_is_boolean(T)

For example, this code:

v := #simd[8]i32 { -1, -2, +3, +4, -5, +6, +7, -8 }
fmt.println(simd.extract_msbs(v))

will print:

bit_set[0..=7]{0, 1, 4, 7}

since elements 0, 1, 4, and 7 have their most significant bits set (due to being negative).

This intrinsic is particularly useful in conjunction with lane-wise comparison masks, in a few particular use cases that I've found so far.

  1. Counting matches. This can be useful on its own or in conjunction with masked_compress_store, for instance, where card(extract_msbs(v)) can be used to determine how many elements will be written. Examples:
    
    count_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (result: int) {
    // Counts the number of values in `src` that are in the range `[lower, upper].`
    for v in src {
        mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper)
        result += card(simd.extract_msbs(mask))
    }
    return
    }

filter_range :: proc (dst: ^[dynamic]f32, src: []#simd[16]f32, lower, upper: f32) { // Appends to dst the values of src that are in the range [lower, upper]. for v in src { mask := simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper) old_len := len(dst) non_zero_resize(dst, old_len + card(simd.extract_msbs(mask)))

no_bounds_check { simd.masked_compress_store(&dst[old_len], v, mask) }

}

}

This use case *can* also be done via `-simd_reduce_add_ordered(mask)`, though I find using `extract_msbs` to be clearer on intent.

2. Identifying which elements matched, e.g. for searching or looping over them. Examples:
```go
first_range :: proc (src: []#simd[16]f32, lower, upper: f32) -> (at: int, found: bool) {
    // Returns the (flattened) index of the first value in `src` that is in the range `[lower, upper]`.
    for v, i in src {
        matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_le(v, auto_cast upper))
        for j in matches {
            at = 16*i + j
            found = true
            return
        }
    }
    return
}

histogram_range :: proc (bins: []int, src: []#simd[16]f32, lower, upper: f32) {
    // Generates histogram bin counts for the values in `src` in the range `[lower, upper)`.
    // bins[0] will be the number of values that fall in the first 1/Nth of the range,
    // bins[1] will be the number of values that fall in the second 1/Nth of the range, etc.
    span := (upper - lower) / f32(len(bins))
    for v in src {
        matches := simd.extract_msbs(simd.lanes_ge(v, auto_cast lower) & simd.lanes_lt(v, auto_cast upper))

        phases := (v - auto_cast lower) / auto_cast span
        indices := cast(#simd[16]i32)phases

        for j in matches {
            // Can't be done in parallel due to potential intersection
            i := int(simd.extract(indices, j))
            bins[i] += 1
        }
    }
    return
}
  1. In more specific cases, the bit_set itself can also be directly useful for doing boolean logic en masse. Condensing the original checks to bit-masks allows the bit-masks themselves to be used in SIMD vectors to do large amounts of boolean logic at once.