simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.39k stars 248 forks source link

Question: Casting M128A to simde__m128 #694

Closed EvgeniySpinov closed 3 years ago

EvgeniySpinov commented 3 years ago

First of all, thank you for the great project!

I'm currently working on static library which emulates particular SSE4.1 instructions, whenever main program is throwing appropriate exception. I'm intercepting such an exception via VectoredExceptionHandler(). This is being done on Win x64 architecture.

The issue is that ContextRecord of that exception contains registers (xmm) in my case in M128A type like so:

//
// Define 128-bit 16-byte aligned xmm register type.
//

typedef struct DECLSPEC_ALIGN(16) _M128A {
    ULONGLONG Low;
    LONGLONG High;
} M128A, *PM128A;

Question would be: is there a way to convert this structure into simde__m128?

Thank you.

nemequ commented 3 years ago

Sure, it's the same as with normal x86 intrinsics; just use simde_mm_load_si128 (or, if you're using native aliases, you can just use _mm_load_si128). SSE2 is in the baseline for x86_64, so SIMDe will really just call the real _mm_load_si128.

Since the data is already in an XMM register and compilers (even MSVC) are really good about optimizing away unnecessary loads, it's likely that this will all be optimized away and you won't even see a movdqa as long as the compiler can inline everything.

Of course this assumes you really meant simde__m128i not simde__m128. If you really want a simde__m128 you can use simde_mm_load_ps (or _mm_load_ps if native aliases are enabled), the idea is the same.

I'm closing this, but feel free to re-open if that didn't answer your question.

EvgeniySpinov commented 3 years ago

Thank you for your detailed reply - and sorry for delayed response.

No matter how I try to use simde_mm_load_ps (I need simde_m128 as result) or _mm_load_ps() I still have error about M128A structure is not a valid argument:

'simde__m128 simde_mm_load_ps(const simde_float32 [])': cannot convert argument 1 from 'M128A *' to 'const simde_float32 []' popcnt_hotpatch

Here are the code snippet from one of the last attempts:

DPPS(0x03A2C33, ctx->Xmm0, ctx->Xmm4, 0x7F, 6);

#define DPPS(offset, dest, src, mask, instr_size) \
    if (rip == g_imageBase + (offset)) { \
        simde__m128 register1 = simde_mm_load_ps(&dest); \
        simde__m128 register2 = simde_mm_load_ps(&src); \
        dest = simde_mm_dp_ps(register1, register2, mask); \
        ctx->Rip += (instr_size); \
        return EXCEPTION_CONTINUE_EXECUTION; \
    }

And absolutely no documentation on how to deal with M128A kind of structure.

I'm still looking for solution, but if you could give me a hint - that would be great.

P.S. There is no option to re-open this ticket.

nemequ commented 3 years ago

No problem, you just need to cast to simde_float32*:

DPPS(0x03A2C33, ctx->Xmm0, ctx->Xmm4, 0x7F, 6);

#define DPPS(offset, dest, src, mask, instr_size) \
    if (rip == g_imageBase + (offset)) { \
        simde__m128 register1 = simde_mm_load_ps((simde_float32*) &dest); \
        simde__m128 register2 = simde_mm_load_ps((simde_float32*) &src); \
        dest = simde_mm_dp_ps(register1, register2, mask); \
        ctx->Rip += (instr_size); \
        return EXCEPTION_CONTINUE_EXECUTION; \
    }

This is true for native aliases, too, though in that case you should probably just use float* instead of simde_float32*:

DPPS(0x03A2C33, ctx->Xmm0, ctx->Xmm4, 0x7F, 6);

#define DPPS(offset, dest, src, mask, instr_size) \
    if (rip == g_imageBase + (offset)) { \
        __m128 register1 = _mm_load_ps((float*) &dest); \
        __m128 register2 = _mm_load_ps((float*) &src); \
        dest = simde_mm_dp_ps(register1, register2, mask); \
        ctx->Rip += (instr_size); \
        return EXCEPTION_CONTINUE_EXECUTION; \
    }
EvgeniySpinov commented 3 years ago

Thank you for prompt response and given direction. Cause I've started to look for conversion of 32-bit aligned double precision integers to arrays. Which seems like a wrong direction.

The question about simde_mm_dp_ps() - if I use native functions like so:

        __m128 register1 = _mm_load_ps((float*) &dest); \
        __m128 register2 = _mm_load_ps((float*) &src); \
                dest = simde_mm_dp_ps((simde__m128 *) register1, (simde__m128 *) register2, mask); \

I'm apparently getting 2 m128 variables, which I need to recast to simdem128 if I want to use simde_mm_dp_ps(). That is causing error: "type cast': cannot convert from 'm128' to 'simdem128 *'"

If I'm not doing recast, I'm getting the same error as below.

If I try with native function, which supposed to work, like so:

                simde__m128 register1 = simde_mm_load_ps((simde_float32 *) &dest); \
        simde__m128 register2 = simde_mm_load_ps((simde_float32 *) &src); \
        dest = simde_mm_dp_ps(register1, register2, mask); \

I'm getting weird error:

Error   C2679   binary '=': no operator found which takes a right-hand operand of type 'simde__m128' (or there is no acceptable conversion) app.cpp 92  
Message     could be '_M128A &_M128A::operator =(_M128A &&)'    C:\Program Files (x86)\Windows Kits\10\Include\10.0.18362.0\um\winnt.h  2581    
Message     or       '_M128A &_M128A::operator =(const _M128A &)'   C:\Program Files (x86)\Windows Kits\10\Include\10.0.18362.0\um\winnt.h  2581    
Message     while trying to match the argument list '(M128A, simde__m128)'  app.cpp 92  

This error occurs on simde_mm_dp_ps() call.

I'm kind of puzzled. First option with native _mm_load_ps() is something I would like to proceed with, cause I do not want to emulate SSE2 instructions.

So is there a way to convert m128 to simdem128? Or I should not do that and either use only native functions of SIMDE or native for the processor?

Would really appreciate your advise.

nemequ commented 3 years ago

I'm apparently getting 2 m128 variables, which I need to recast to simdem128 if I want to use simde_mm_dp_ps(). That is causing error: "type cast': cannot convert from 'm128' to 'simdem128 *'"

If SSE is enabled, __m128 will be a typedef to simde__m128. The problem here is that you are casting to simde__m128 * instead of simde__m128. You could just remove the * and it will work. ``

That said, there really isn't a good reason to be mixing simde_-prefixed functions with non-prefixed functions. Just because you're using simde_mm_load_ps doesn't mean you're emulating instructions. In practice, SIMDe does something like this:

simde__m128
simde_mm_load_ps(simde_float32* mem_addr) {
  #if defined(__SSE__)
    return _mm_load_ps(mem_addr);
  #else
    // portable version
  #endif
}
#if defined(SIMDE_ENABLE_NATIVE_ALIASES) && !defined(SIMDE_X86_SSE2_NATIVE)
  #define _mm_load_ps(mem_addr) simde_mm_load_ps(mem_addr)
#endif

So if you call simde_mm_load_ps on a system which supports SSE, SIMDe will just call _mm_load_ps. There is no overhead here from SIMDe; even with the most basic optimizations enabled in the compiler everything gets inlined and the exact same code is generated. The same goes for simde_mm_dp_ps: https://godbolt.org/z/vGboPf

Since you're writing new code I would suggest just sticking with the prefixed versions. Native aliases are really great for when you're porting existing code and want to minimize patches, but there are several corner cases where they can cause problems so it's really better to just call the prefixed functions.

If I try with native function, which supposed to work, like so:

That's probably because you're trying to assign a simde__m128 (which is returned by the simde_mm_dp_ps call) to a _M128A, which is not the same thing as a __m128. The story here is basically the same as with simde_mm_load_ps, but you're going the other direction. For that, you should be using simde_mm_store_ps. Like simde_mm_load_ps, the function takes a pointer to 32-bit floats, so you'll need a cast. Try something like:

simde_m128 register3 = simde_mm_dp_ps(register1, register2, mask);
simde_mm_store_ps((simde_float32*) &dest, register3);

Just like with simde_mm_load_ps, the compiler is likely to optimize this away.

So is there a way to convert m128 to simdem128?

There isn't, because if __m128 exists then simde__m128 is just a typedef to __m128, meaning they are exactly the same type and no conversion function is necessary. They are only different when SSE is not supported natively, in which case there can't be a conversion function because there is no __m128 (unless you're using native aliases, in which case __m128 will be a typedef to simde__m128 and you can again use them interchangeably).

Basically, it works a bit like this:

#if defined(__SSE__)
  typedef __m128 simde__m128;
#else
  typedef struct { /* ... */ } simde__m128;
#endif

#if defined(SIMDE_ENABLE_NATIVE_ALIASES) && !defined(__SSE__)
  typedef simde__m128 __m128;
#endif

Or I should not do that and either use only native functions of SIMDE or native for the processor?

Yes, use one or the other. Just remember that if you want to use the unprefixed versions (i.e., _mm_dp_ps instead of simde_mm_dp_ps) you need to defined SIMDE_ENABLE_NATIVE_ALIASES before including SIMDe. For example:

#define SIMDE_ENABLE_NATIVE_ALIASES
#include <simde/x86/>
EvgeniySpinov commented 3 years ago

That is a very detailed explanation. Thank you - I was able to understand mechanics and was able to successfully build up a code.

I have an issue with simde_mm_store_ps() call however: it seems that it's not possible to update registers that way from vector exception handler. I can see updated value in register (using store() and load() calls), but on return to the program after exception, register seems to be updated with stacked value before exception occurred and my updates to it is ignored.

If you had faced this issue previously and know the way to resolve it - I would really appreciate your help.

nemequ commented 3 years ago

I have an issue with simde_mm_store_ps() call however: it seems that it's not possible to update registers that way from vector exception handler. I can see updated value in register (using store() and load() calls), but on return to the program after exception, register seems to be updated with stacked value before exception occurred and my updates to it is ignored.

Unfortunately I don't think I'll be able to help much here; at this point I think what you have is more of a VectoredExceptionHandler question than a SIMDe question, and I'm really not familiar with that API.

All simde_mm_store_ps is doing is copying data; it's basically like memcpy with a fixed size where the destination is aligned to a 16-byte boundary.

If I were you I would put together a quick test case which doesn't involve SIMDe; the M128A struct has two 64-bit members, so the easy thing would be something which operates on 64-bit lanes or a bitwise operation. Something like

dest.Low += src.Low;
dest.High += src.High;

Then try to get that returning the value you expect; it should be simple enough that you could post the complete example on MSDN or StackOverflow if you need help. Once you have that it should be pretty straightforward to swap out the addition (or whatever) with what you really want, including a SIMDe call. I'm happy to help if you have trouble at that point, I'm just not a good person to ask about the Windows API.

EvgeniySpinov commented 3 years ago

Fair enough. Thank you for the help and time dedicated, I really appreciate your help.