Saturating casts for integers, like `_mm_packus_epi16` and `_mm_packs_epi16`

okaneco commented 1 year ago

My most common use case is for saturating narrowing casts, such as from i16 to u8. It would be great to take advantage of native instructions without first having to use simd_clamp. saturating_cast would reduce code and chances of errors from clamping with incorrect values.

I'm not sure about the availability of this feature on other architectures. It might make sense to split this for narrowing and widening features based on availability. I'm not familiar with other architectures, but it looks like ARM has narrowing casts with VQMOVN/VQMOVUN.

Perhaps there could be a fallback which does simd_clamp().cast() (or other optimal construction) for architectures that don't have these instructions.

calebzulawski commented 1 year ago

Is there a saturating cast for regular (scalar) integers? Not that that's a requirement, but it's a good guide as to what to include. I'm not opposed to adding it, though. It's worth noting that simd_clamp().cast() does the right thing on both x86-64 and aarch64: godbolt%0A++++.cast()%0A%7D%0A%0A//+vpackuswb+on+x86-64%0Apub+fn+saturate_i16u8(x:+i16x8)+-%3E+u8x8+%7B%0A++++x.simd_clamp(%0A++++++++i16x8::splat(u8::MIN+as+i16),%0A++++++++i16x8::splat(u8::MAX+as+i16),%0A++++)%0A++++.cast()%0A%7D%0A%0A//+uqxtn+on+aarch64%0Apub+fn+saturate_u16u8(x:+u16x8)+-%3E+u8x8+%7B%0A++++x.simd_clamp(%0A++++++++u16x8::splat(u8::MIN+as+u16),%0A++++++++u16x8::splat(u8::MAX+as+u16),%0A++++)%0A++++.cast()%0A%7D'),l:'5',n:'0',o:'Rust+source+%231',t:'0')),k:33.13583050149196,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:nightly,deviceViewOpen:'1',filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:rust,libs:!(),options:'-C+opt-level%3D3+--target+x86_64-unknown-linux-gnu+-C+target-feature%3D%2Bavx2',overrides:!((name:edition,value:'2021')),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+rustc+nightly+(Editor+%231)',t:'0')),header:(),k:33.53083616517472,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:nightly,deviceViewOpen:'1',filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,libs:!(),options:'-C+opt-level%3D3+--target+aarch64-unknown-linux-gnu+-C+target-feature%3D%2Bneon',overrides:!((name:edition,value:'2021')),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+rustc+nightly+(Editor+%231)',t:'0')),k:33.33333333333333,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)

okaneco commented 1 year ago

Not that I'm aware of in std but I wish scalars had this, as I need it a lot in contexts like image processing.

Part of what motivated this issue was that I was using i16x4/u8x4 on x86-64 with the base feature profile. I noticed that it didn't produce similar code to the i16x8 example without some coaxing (which seems to pessimize the aarch output). https://rust.godbolt.org/z/edxro99e3

calebzulawski commented 1 year ago

That's definitely an LLVM issue, I'm not sure why it only happens with the 4-length vectors. I opened the linked issue to track this. I reviewed the LLVM source and simd_clamp is the right way to indicate saturation:

/// Detect patterns of truncation with unsigned saturation:
///
/// 1. (truncate (umin (x, unsigned_max_of_dest_type)) to dest_type).
///   Return the source value x to be truncated or SDValue() if the pattern was
///   not matched.
///
/// 2. (truncate (smin (smax (x, C1), C2)) to dest_type),
///   where C1 >= 0 and C2 is unsigned max of destination type.
///
///    (truncate (smax (smin (x, C2), C1)) to dest_type)
///   where C1 >= 0, C2 is unsigned max of destination type and C1 <= C2.
///
///   These two patterns are equivalent to:
///   (truncate (umin (smax(x, C1), unsigned_max_of_dest_type)) to dest_type)
///   So return the smax(x, C1) value to be truncated or SDValue() if the
///   pattern was not matched.
static SDValue detectUSatPattern(SDValue In, EVT VT, SelectionDAG &DAG,
                                 const SDLoc &DL) {

okaneco commented 1 year ago

Thanks for that. It's good to know that what I saw was an exception with the extra instructions.

I've possibly come across another issue with 4-vectors and casting from i16x4 to u8x4. If I use a temporary array and manually unroll the cast into that destination, it produces vector instructions. https://rust.godbolt.org/z/WrKdjx9sa

Feel free to move this to its own issue if needed.

okaneco commented 8 months ago

Both optimization patterns have now been fixed upstream. The saturating truncating cast fix is in LLVM 18, available on nightly https://rust.godbolt.org/z/edxro99e3 The truncating cast issue was recently fixed and will probably be in LLVM 19 (above, https://rust.godbolt.org/z/WrKdjx9sa) https://llvm.godbolt.org/z/hTbnadsnj

rust-lang / portable-simd

Saturating casts for integers, like `_mm_packus_epi16` and `_mm_packs_epi16` #369