Finish NEON implementations for Intel instruction sets

nemequ commented 4 years ago

I recently finished up the NEON implementations for MMX and SSE, SSE2 is next on the list.

There are a lot of NEON implementations of SSE2 (thanks largely to the code I stole from sse2neon), but I'd like to finish the missing functions.

AnkitRai-22 commented 4 years ago

I want to work on this issue under GSoC project proposal by Open Bioinformatics Foundation.

AnkitRai-22 commented 4 years ago

Should I mail you my proposal now or should I wait for the date when proposal submission starts( 16th March 2020) ?

nemequ commented 4 years ago

The earlier the better, that way there is time to provide feedback. Note that SSE2 is basically done, so this would really be for later extensions, and I'm not sure there are enough remaining functions for this to be a complete task. Another ISA extension may be better suited.

One interesting idea could be a portable implementation of the 128-bit functions in SVML and NEON implementations of those functions.

nemequ commented 4 years ago

Moving the goalposts a bit… SSE2 is mostly done; I may have missed a few functions, but for the most part the functions that are missing NEON implementations are functions that I can't think of a way to implement them in NEON that would be faster than the portable implementations.

That said, the later instruction sets (SSE3, SSSE3, SSE4.1, AVX, AVX2, AVX-512*) need the same treatment, so I'm widening the scope of the issue. Obviously this doesn't all need to be done at once, but eventually it would be great to have NEON implementations for as many functions as possible.

mcelhennyi commented 4 years ago

@nemequ So is the suggestion for the time being to use the SSE2 version in the code to support both the x86 side as well as the arm side fully?

I am wondering if I should use the 256bit impl vs the 128bit impl for the sets of functions I need to make SPTAG - DistanceUtils.h work on both ARM and x86 well.

Is there an easy way to check if the ARM/NEON mapping exists for a given call? OR is it such that all calls have some sort of mapping from AVX2/SSE2 to a ARM compatible call, it just may not be neon based?

Forgive the ignorant questions, this is only my second day looking into SIMD/NEON/SSE2/AVX2/SIMDe related stuff :)

nemequ commented 4 years ago

@nemequ So is the suggestion for the time being to use the SSE2 version in the code to support both the x86 side as well as the arm side fully?

I am wondering if I should use the 256bit impl vs the 128bit impl for the sets of functions I need to make SPTAG - DistanceUtils.h work on both ARM and x86 well.

Honestly it could go either way. If possible, I'd recommend trying both to see what is faster.

A lot of the 256-bit functions will fall back on just calling the 128-bit versions twice, optimizing the 128-bit functions often gives us free optimized 256-bit functions. The ones that don't can generally be optimized by the compiler and we don't want to get in the way since it's possible the compiler can do an even better job than our explicit implementations. For example, if your CPU supports ARM SVE it's possible that the compiler can use those instructions for 256/512-bit operations and the result would be significantly faster than if we had forced it to use NEON.

There are probably some missed opportunities to reuse 128-bit optimizations which would make the 256-bit versions slower, but there may also be some places where 256-bit versions can make optimizations not possible for 128-bit versions.

Sorry, there are a lot of "it's complicated" answers when dealing with this type of stuff. Luckily it should be possible to get both versions working (as long as SIMDe supports all the necessary functions) so you can benchmark.

Is there an easy way to check if the ARM/NEON mapping exists for a given call? OR is it such that all calls have some sort of mapping from AVX2/SSE2 to a ARM compatible call, it just may not be neon based?

Well, all functions have a portable mapping that may or may not generate NEON instructions. I just added an entry to the FAQ to answer this: Is it possible to tell if my code is using an unoptimized implementation?.

Basically, if you're using a good compiler it can automatically vectorize the portable implementations. Since you're targeting ARM you're probably not using MSVC, so odds are decent you'll get pretty fast code even without an explicit NEON version, but you should profile the result and take a close look at the top couple functions.

Forgive the ignorant questions, this is only my second day looking into SIMD/NEON/SSE2/AVX2/SIMDe related stuff :)

Not at all, these are both good questions.

mr-c commented 4 years ago

I'm refreshing from https://github.com/soedinglab/MMseqs2/blob/128f57b5d7a16ae57a7f7b497224f951936a0b35/lib/simd/sse2neon.h in https://github.com/nemequ/simde/compare/sse2neon_refresh , but there will be plenty still to go after this is done

simd-everywhere / simde

Finish NEON implementations for Intel instruction sets #73