jdelauney / SIMD-VectorMath-UnitTest

For testing asm SIMD (SSE/SSE 2/SSE 3/SSE 4.x / AVX /AVX 2) vector math library (2f, 4f, matrix, quaternion...) with Lazarus and FreePascal Compiler
Mozilla Public License 2.0
8 stars 0 forks source link

SIGSEGV with USE_ASM_SSE_3 #19

Closed jdelauney closed 6 years ago

jdelauney commented 6 years ago

a Strange behaviour in Vector2f and Vector4f the function Max and Clamp, ClampSingle, raise an SIGSEGV and directly go in debugger : 0000000100212797 ff9230010000 callq *0x130(%rdx)

??????

Tests working with USE_ASM_SSE !!!!!!!!!

Very very strange, and i'm no doing any change since you've check the Win64 code last time

jdelauney commented 6 years ago

It is due to the extracfg file ?

jdelauney commented 6 years ago

no, but it's a problem with alignment but why ?

jdelauney commented 6 years ago

I've trace the 1st sigsegv on TGLZVector4i in all operators the result is not aligned !!!!!! by changing movdqa [result],xmm0 by movdqu [result],xmm0 and come back to the normal and all it's ok. It's very strange by using USE_ASM_SSE all work fine when i set USE_ASM_SSE_3 all is breaking !!!! have you a clue Peter ? why result become unaligned ?????

jdelauney commented 6 years ago

Ok just check USE_ASM_SSE do not active USE_ASM so it was NATIVE

And now all SSE code is break do to this unalignement, it worked before so why not working now. It's a mistery fo me

No clue in the S file

.section .text.n_glzvectormath$_$tglzvector4i_$__$$_plus$tglzvector4i$tglzvector4i$$tglzvector4i,"x"
    .balign 16,0x90
.globl  GLZVECTORMATH$_$TGLZVECTOR4I_$__$$_plus$TGLZVECTOR4I$TGLZVECTOR4I$$TGLZVECTOR4I
GLZVECTORMATH$_$TGLZVECTOR4I_$__$$_plus$TGLZVECTOR4I$TGLZVECTOR4I$$TGLZVECTOR4I:
# Var A located in register rdx
# Var B located in register r8
# Var $result located in register rcx
# [vectormath_vector4i_win64_sse_imp.inc]
# [4] asm
    # Register rax,rcx,rdx,r8,r9,r10,r11 allocated
# [5] movdqa xmm0,[A]
    movdqa  (%rdx),%xmm0
# [9] movaps xmm1,[B]
    movaps  (%r8),%xmm1
# [10] paddd  xmm0, xmm1
    paddd   %xmm1,%xmm0
# [12] movdqa [RESULT], xmm0
    movdqa  %xmm0,(%rcx)
    # Register rax,rcx,rdx,r8,r9,r10,r11 released
# [13] end;
    ret

after change Movdqa by Movdqu no i've sigsegv on this line 00000001000720B4 410f2808 movaps (%r8),%xmm1

it's completely silly i don't understand why vars are unaligned

dicepd commented 6 years ago

I will break out the win64 box and have a look

jdelauney commented 6 years ago

I confirm all are unaligned A,B and RESULT

jdelauney commented 6 years ago

I think a $CODEALIGN somewhere it's the cause

jdelauney commented 6 years ago

Ok i'v add {$CODEALIGN LOCALMIN=16} {$CODEALIGN CONSTMIN=16}
in the top of Vector4iFunctionalTest and it's work now. I'll need check the others test units. But now many test are break (ps i don't using FastMath for my test)

dicepd commented 6 years ago

What was happening without those was result was pointing to stack and caller was retrieving result from the stack. The stack was not aligned, I think it is the LOCALMIN that aligns the stack.

jdelauney commented 6 years ago

Ok the final result for TGLZVector2 with

for TGLZVector4i with

All others test are Green

So i must compare your Unix code with my Win64 code

Don't understand why all was green last time

dicepd commented 6 years ago

I get lots of failure for SSE3 too

dicepd commented 6 years ago

I have been just grinding out tests for pascal and not checking asm

dicepd commented 6 years ago

These are new tests to test all possibilities (at least the ones I can think of) so I was kinda expecting some breaks.

dicepd commented 6 years ago

Div and DivInt are not truncing in asm they are rounding. Pascal truncs Div

dicepd commented 6 years ago

<> should be set true on none zero not zero

jdelauney commented 6 years ago

for Combine ithat's what you told me about the other issue I take a look more deep tonight

Thanks

dicepd commented 6 years ago

It would seem my boring grinding out of these tests is not in vain or a waste of time then :)

jdelauney commented 6 years ago

It's clear and without you i'll never do so many test.

One thing i don't understand is why under Unix64 all test are green. The code between Unix64 and Win64 was the same !!!!!

Now with the functionals tests

I've fixed Normalize in TGLZVector2f all is ok now.

After with TGLZVector4i.Div it's ok just needed surrounding "Trunc" with push /pop RCX because we use ECX in "Trunc" and that the result is in RCX.

Now 3 functions don't work in TGLZVector4i

dicepd commented 6 years ago

Unix 64 was broken as well after the 4i tests.

Abs mask only works for singles, integer negate needs 2s complement.

dicepd commented 6 years ago

Ok all green again in win64 ( until I checkin the next test unit ;) )

jdelauney commented 6 years ago

Ok now all is green for me to, until next test 👍

Just one thing, can you explain me why in Combine function we need to do an extra load of the w component ? I don't understand this trick

jdelauney commented 6 years ago

I'm just doing a test with timing TGLZVector4i.Combine2 and 3 Rasise SIGSEGV at 0000000100072491 660f7f02 movdqa %xmm0,(%rdx)

jdelauney commented 6 years ago

Perhaps a clue, i'll test tonight. Now i coding a new class I'm writing {$CODEALIGNS } i'm set some property in public and do a CTRL+SHIFT+C the completation put functions and procedure inside the block of {$CODEALIGNS RECORDMIN=16} and {$CODEALIGNS RECORDMIN=4}...hmmmm

dicepd commented 6 years ago

Just one thing, can you explain me why in Combine function we need to do an extra load of the w component ? I don't understand this trick

The pascal code wants the same W value as self. If I do hlps type move it preserves whatever junk was already there, I needed to ensure some 0 values in the reg for the shuffle without clearing the reg first. So choices are op1 xor a reg op2 movhlxx to reg op3 shuffle to get 0,0,0,W op1 copy reg, op2 mask w only (large opcode 4 single single move) op1 loadss (clears upper to 0), op2 shuffle (gets a var already in local cache), bit of luck pipeline optimiser uses value already in reg.

if processor has seperate int v float or multiple piplines then all w ops will have completed before float result is done. It is quite hard to determine what the processor will do, so just give it a chance by specifying ops that could be run side by side( on an infinite pipeline machine) together using separate regs and hope the pipeline optimser does its job.

dicepd commented 6 years ago

To formalise how I think about these problems look at the rough diagram attached. Thick horizontal lines are hard breaks for any chance of concurrency, in SSE term we have to wait for the pipeline to empty before we can begin the next operation. If I manipulate self at the top of the diagram to extract W then I introduce another op before I can fill the lines with the singles and shuffle ops. Delaying the load and shuffle of W should allow the op path to reduce by one. Look at all the chance there are for the pipline optimiser to give us that W shuffle operation for free. There may be a more optimal place to do the load W, i.e. where concurrency is poor, than I have placed it at the moment, but I was just 'getting the balls green" when I wrote that.

diagram1

Its a lot easier just doing this in my head that describing how I do it :)

dicepd commented 6 years ago

Now I have that diagram I can see another optimsation which gets rid of the Mask Res line. Change shuffle F1 to |0|F1|F1|F1| xmm order and F2 similar the we have cleared all W to 0.

That gave me an increase from 6.8 -> 7.7 on speed factor. :) and this is now hmg safe.

Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.

jdelauney commented 6 years ago

To formalise how I think about these problems look at the rough diagram attached Its a lot easier just doing this in my head that describing how I do it :)

Ok i understand a little bit more. we have a lattency (some docs i'read are a little bit clear in my mind, now) on the subject with SSE

Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.

Yes by comparing with my code i understand and see where was my error :)

jdelauney commented 6 years ago

With the same trick it will be easy for manage affine vector in an Hmg vector and ensure the result of op, no ?

jdelauney commented 6 years ago

Ok i made the changes in win64 imp and retest timing i've always sigsegv RDX (the result) become unaligned at one point. Changing movdqa by movdqu [Result], xmm0 solve the problem and all are green

dicepd commented 6 years ago

Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.

That was my silly mistake not yours ;)

jdelauney commented 6 years ago

So it's ok with movdqa now, just added {$CODEALIGN LOCALMIN=16} in vector4iTimingTest unit 👍