Fix lane offsets for AVX2 pack instructions

rust-lang / rustc_codegen_cranelift

Cranelift based backend for rustc

Apache License 2.0

1.52k stars 94 forks source link

Fix lane offsets for AVX2 pack instructions #1442

Closed Nilstrieb closed 6 months ago

Nilstrieb commented 6 months ago

fast_image_resize yielded broken images, a little bit of println bisecting revealed the SIMD instruction that was at fault. A bit of staring at the cg_clif impl and the Intel manual then revealed the place of the bug. There is a lot of copy pasting here, so I'm not surprised it's buggy ^^'.

Nilstrieb commented 6 months ago

I'm not sure where tests for this are supposed to go. stdarch tests?

Nilstrieb commented 6 months ago

The duplicated code for these packs does make me worry a bit. After going through the intrinsics guide, I also found some packs that weren't implemented yet. I think I'm going to restructure the code here so that the packs are neatly packed together, with all of _mm{,256}_pack{u,us}_epi{16,32} implemented.

bjorn3 commented 6 months ago

I'm not sure where tests for this are supposed to go.

I've been copying stdarch tests into example/std_example.rs several times.

There is a lot of copy pasting here, so I'm not surprised it's buggy ^^'.

Yeah, this code is horrible. I hope to some day generate it directly from the instruction manual or something like that. Or create a DSL that allows writing this kind of stuff with less code duplication (and maybe also allows it to be reused by miri and other tools).

bjorn3 commented 6 months ago

Thanks for the fix! Please ignore the test failure. That is https://github.com/rust-random/rand/issues/1355.

Nilstrieb commented 6 months ago

What's currently implemented vs what exists:		sse 16	avx 16	sse 32
unsigned	_mm_packus_epi16\|llvm.x86.sse2.packuswb.128 ✅	_mm256_packus_epi16\|llvm.x86.avx2.packuswb ✅	_mm_packus_epi32\|llvm.x86.sse41.packusdw ✅	_mm256_packus_epi32\|llvm.x86.avx2.packusdw
signed	_mm_packs_epi16\|llvm.x86.sse2.packsswb.128	_mm256_packs_epi16\|llvm.x86.avx2.packsswb	_mm_packs_epi32\|llvm.x86.sse2.packssdw.128 ✅	_mm256_packs_epi32\|llvm.x86.avx2.packssdw ✅

I'll clean it up a bit and implement all of those based on that, should be fairly little code. llvm.x86.sse41.packusdw is also pretty suspicious as it currently uses smin, while the other unsigned ones use umin.

bjorn3 commented 6 months ago

llvm.x86.sse41.packusdw is also pretty suspicious as it currently uses smin, while the other unsigned ones use umin.

Smin is correct here afaict. The input is a signed 32bit integer and we need to check that it fits in an unsigned 16bit integer. Using umin would cause the input to be interpreted as unsigned 32bit integer. Although because of the smax before it, I think it does actually not matter at all if umin or smin is used.

bjorn3 commented 6 months ago

In any case having a helper function for doing the saturating equivalent of ireduce as is done here would be nice to have. It can probably go in num.rs or cast.rs.

Nilstrieb commented 6 months ago

I created #1443 to restructure all the packed code.

Nilstrieb commented 6 months ago

closing in favor of #1443

What's currently implemented vs what exists:		sse 16	avx 16	sse 32
unsigned	_mm_packus_epi16\|llvm.x86.sse2.packuswb.128 ✅	_mm256_packus_epi16\|llvm.x86.avx2.packuswb ✅	_mm_packus_epi32\|llvm.x86.sse41.packusdw ✅	_mm256_packus_epi32\|llvm.x86.avx2.packusdw
signed	_mm_packs_epi16\|llvm.x86.sse2.packsswb.128	_mm256_packs_epi16\|llvm.x86.avx2.packsswb	_mm_packs_epi32\|llvm.x86.sse2.packssdw.128 ✅	_mm256_packs_epi32\|llvm.x86.avx2.packssdw ✅