Open bjoo opened 4 years ago
I am wondering about the constructors with a subset of elements. Since this is specifically for permute vectors, maybe a non-member function to create a permute vector would be better? I would actually prefer an interface where the permute would take a permute integer simd type of the same length as the simd value type, but I take it from you that that is too expansive since you have to convert it internally?
We need to think about what kind of interface would be acceptable to the C++ standard.
I can make a make_permute() function call. That is a nice way of solving that oddball constructor issue.
I do actually use simd<int, simd_abi::whatever> for the control, primarily for efficiency on GPUs where you want to spend maybe one register per mask, not 32. So it is filled ‘via’ simd_storage on the host where you give the full vector length permute. But the actual mask is a simd<int,…> like you desire.
We need to think about what kind of interface would be acceptable to the C++ standard.
Indeed, I never actually dared to think that far ahead. Have a look here:
for AVX512:
https://github.com/bjoo/simd-math-testing/blob/master/test/avx512-tests/test_simd_avx512_permute.cpp
and for e.g. CUDA:
https://github.com/bjoo/simd-math-testing/blob/master/test/cuda-tests/test_simd_cuda_permute.cpp
I’ll go add the make_permute() function…
Best, B
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi, as suggested by @crtrott I have modified the interface by adding make_permute
.
I have also updated the test codes to see how it could be used.
Actually another thing, now that this is kinda traits-y so we have
simd::simd_utils< simd<T,Abi>::make_permute(int mask[ simd<T,Abi>::size() ] )
there is nothing in principle to make the 'int mask' a compile time template parameter i.e:
simd::simd_utils<simd<T,Abi>::template make_permute<from0,from1,....fromN>(void)
where from0
is what would otherwise be mask[0],
from1
is what would otherwise be mask[1]
etc.
however for warp-sizes of 64, that could get unwieldy.
NB: SYCL has its own way of doing shuffles with its vec<T,N>
type which is templatized like this, but i) is not a warp parallel permute, and ii) currently has a maximum size of 16. There the shuffle operator returns a shuffled_vec<T,N>
type and the shuffle is not actually carried out until this is assigned to another vec<T,N>
. That approach could also be a way to go towards standardization too, tho what I have now is simple, and relatively portable. For SSE it would be annoying to implement the shuffle, as one would need to use _mm_shuffle_ps/pd
where the masks are immediates :( So probably we'd need to do something gross like 'make_permute'
returns the immediate, and then the permute would need some horrible switch statement with something like
switch( control_value ) { case 0 : _mm_shuffle_ps(reg1, reg1, 0x0; break; case 1 : _mm_shuffle_ps(reg1, reg1, 0x01; break; ... };
This gets pretty tedious to write. Ditto for AVX2 Double Prec shuffles (_mm256_permutevar_pd is in AVX512 :( )
Let me know if you need anything else added to this.
@crtrott @bjoo sorry for not seeing this earlier. @crtrott did @bjoo sufficiently address your feedback? If so I can look at merging.
Hi All, Here are the additions for vector lane permute. Intrinsics for AVX (Single Prec), AVX512 (single prec and double prec), Generic for other CPUs, __shfl_sync for CUDA, __shfl for HIP.
I also came accross an overflow possibility with the CUDA masks when N=32 (got compiler warning) so I made that go away.
A couple of FIXME's still left in there as well as a funny for AVX512 where the permutexvar intrinsic looks at every second element of the permutation index table register, and intevening elements are zeroed out, hence the funny 8 element constructor for the simd<int,> type. Maybe worth renaming these to something like simd_permute_control_t in the future.
This was rebased onto master after #13 was merged.
The testing harness is at: https://github.com/bjoo/simd-math-testing.git and shows how the permutes are being called.