kokkos / simd-math

Library for length agnostic SIMD intrinsic support and the corresponding math operations
Other
20 stars 10 forks source link

Feature/permute rebase13 #18

Open bjoo opened 4 years ago

bjoo commented 4 years ago

Hi All, Here are the additions for vector lane permute. Intrinsics for AVX (Single Prec), AVX512 (single prec and double prec), Generic for other CPUs, __shfl_sync for CUDA, __shfl for HIP.

I also came accross an overflow possibility with the CUDA masks when N=32 (got compiler warning) so I made that go away.

A couple of FIXME's still left in there as well as a funny for AVX512 where the permutexvar intrinsic looks at every second element of the permutation index table register, and intevening elements are zeroed out, hence the funny 8 element constructor for the simd<int,> type. Maybe worth renaming these to something like simd_permute_control_t in the future.

This was rebased onto master after #13 was merged.

The testing harness is at: https://github.com/bjoo/simd-math-testing.git and shows how the permutes are being called.

crtrott commented 4 years ago

I am wondering about the constructors with a subset of elements. Since this is specifically for permute vectors, maybe a non-member function to create a permute vector would be better? I would actually prefer an interface where the permute would take a permute integer simd type of the same length as the simd value type, but I take it from you that that is too expansive since you have to convert it internally?

We need to think about what kind of interface would be acceptable to the C++ standard.

bjoo commented 4 years ago

I can make a make_permute() function call. That is a nice way of solving that oddball constructor issue.

I do actually use simd<int, simd_abi::whatever> for the control, primarily for efficiency on GPUs where you want to spend maybe one register per mask, not 32. So it is filled ‘via’ simd_storage on the host where you give the full vector length permute. But the actual mask is a simd<int,…> like you desire.

We need to think about what kind of interface would be acceptable to the C++ standard.

Indeed, I never actually dared to think that far ahead. Have a look here:

for AVX512:

https://github.com/bjoo/simd-math-testing/blob/master/test/avx512-tests/test_simd_avx512_permute.cpp

and for e.g. CUDA:

https://github.com/bjoo/simd-math-testing/blob/master/test/cuda-tests/test_simd_cuda_permute.cpp

I’ll go add the make_permute() function…

Best, B

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave Suite 3, MS 12B2, Room F217 Newport News VA 23606, USA Tel: +1-757-269-5339 email: bjoo AT jlab.org

bjoo commented 4 years ago

Hi, as suggested by @crtrott I have modified the interface by adding make_permute. I have also updated the test codes to see how it could be used.

Actually another thing, now that this is kinda traits-y so we have

simd::simd_utils< simd<T,Abi>::make_permute(int mask[ simd<T,Abi>::size() ] )

there is nothing in principle to make the 'int mask' a compile time template parameter i.e:

simd::simd_utils<simd<T,Abi>::template make_permute<from0,from1,....fromN>(void)

where from0 is what would otherwise be mask[0], from1 is what would otherwise be mask[1] etc.

however for warp-sizes of 64, that could get unwieldy.

NB: SYCL has its own way of doing shuffles with its vec<T,N> type which is templatized like this, but i) is not a warp parallel permute, and ii) currently has a maximum size of 16. There the shuffle operator returns a shuffled_vec<T,N> type and the shuffle is not actually carried out until this is assigned to another vec<T,N>. That approach could also be a way to go towards standardization too, tho what I have now is simple, and relatively portable. For SSE it would be annoying to implement the shuffle, as one would need to use _mm_shuffle_ps/pd where the masks are immediates :( So probably we'd need to do something gross like 'make_permute' returns the immediate, and then the permute would need some horrible switch statement with something like

switch( control_value ) { case 0 : _mm_shuffle_ps(reg1, reg1, 0x0; break; case 1 : _mm_shuffle_ps(reg1, reg1, 0x01; break; ... };

This gets pretty tedious to write. Ditto for AVX2 Double Prec shuffles (_mm256_permutevar_pd is in AVX512 :( )

Let me know if you need anything else added to this.

ibaned commented 4 years ago

@crtrott @bjoo sorry for not seeing this earlier. @crtrott did @bjoo sufficiently address your feedback? If so I can look at merging.