Closed jdelauney closed 6 years ago
It is due to the extracfg file ?
no, but it's a problem with alignment but why ?
I've trace the 1st sigsegv on TGLZVector4i in all operators the result is not aligned !!!!!! by changing movdqa [result],xmm0 by movdqu [result],xmm0 and come back to the normal and all it's ok. It's very strange by using USE_ASM_SSE all work fine when i set USE_ASM_SSE_3 all is breaking !!!! have you a clue Peter ? why result become unaligned ?????
Ok just check USE_ASM_SSE do not active USE_ASM so it was NATIVE
And now all SSE code is break do to this unalignement, it worked before so why not working now. It's a mistery fo me
No clue in the S file
.section .text.n_glzvectormath$_$tglzvector4i_$__$$_plus$tglzvector4i$tglzvector4i$$tglzvector4i,"x"
.balign 16,0x90
.globl GLZVECTORMATH$_$TGLZVECTOR4I_$__$$_plus$TGLZVECTOR4I$TGLZVECTOR4I$$TGLZVECTOR4I
GLZVECTORMATH$_$TGLZVECTOR4I_$__$$_plus$TGLZVECTOR4I$TGLZVECTOR4I$$TGLZVECTOR4I:
# Var A located in register rdx
# Var B located in register r8
# Var $result located in register rcx
# [vectormath_vector4i_win64_sse_imp.inc]
# [4] asm
# Register rax,rcx,rdx,r8,r9,r10,r11 allocated
# [5] movdqa xmm0,[A]
movdqa (%rdx),%xmm0
# [9] movaps xmm1,[B]
movaps (%r8),%xmm1
# [10] paddd xmm0, xmm1
paddd %xmm1,%xmm0
# [12] movdqa [RESULT], xmm0
movdqa %xmm0,(%rcx)
# Register rax,rcx,rdx,r8,r9,r10,r11 released
# [13] end;
ret
after change Movdqa by Movdqu no i've sigsegv on this line 00000001000720B4 410f2808 movaps (%r8),%xmm1
it's completely silly i don't understand why vars are unaligned
I will break out the win64 box and have a look
I confirm all are unaligned A,B and RESULT
I think a $CODEALIGN somewhere it's the cause
Ok i'v add
{$CODEALIGN LOCALMIN=16}
{$CODEALIGN CONSTMIN=16}
in the top of Vector4iFunctionalTest and it's work now. I'll need check the others test units.
But now many test are break (ps i don't using FastMath for my test)
What was happening without those was result was pointing to stack and caller was retrieving result from the stack. The stack was not aligned, I think it is the LOCALMIN that aligns the stack.
Ok the final result for TGLZVector2 with
for TGLZVector4i with
All others test are Green
So i must compare your Unix code with my Win64 code
Don't understand why all was green last time
I get lots of failure for SSE3 too
I have been just grinding out tests for pascal and not checking asm
These are new tests to test all possibilities (at least the ones I can think of) so I was kinda expecting some breaks.
Div and DivInt are not truncing in asm they are rounding. Pascal truncs Div
<> should be set true on none zero not zero
for Combine ithat's what you told me about the other issue I take a look more deep tonight
Thanks
It would seem my boring grinding out of these tests is not in vain or a waste of time then :)
It's clear and without you i'll never do so many test.
One thing i don't understand is why under Unix64 all test are green. The code between Unix64 and Win64 was the same !!!!!
Now with the functionals tests
I've fixed Normalize in TGLZVector2f all is ok now.
After with TGLZVector4i.Div it's ok just needed surrounding "Trunc" with push /pop RCX because we use ECX in "Trunc" and that the result is in RCX.
Now 3 functions don't work in TGLZVector4i
Unix 64 was broken as well after the 4i tests.
Abs mask only works for singles, integer negate needs 2s complement.
Ok all green again in win64 ( until I checkin the next test unit ;) )
Ok now all is green for me to, until next test 👍
Just one thing, can you explain me why in Combine function we need to do an extra load of the w component ? I don't understand this trick
I'm just doing a test with timing TGLZVector4i.Combine2 and 3 Rasise SIGSEGV at 0000000100072491 660f7f02 movdqa %xmm0,(%rdx)
Perhaps a clue, i'll test tonight. Now i coding a new class I'm writing {$CODEALIGNS } i'm set some property in public and do a CTRL+SHIFT+C the completation put functions and procedure inside the block of {$CODEALIGNS RECORDMIN=16} and {$CODEALIGNS RECORDMIN=4}...hmmmm
Just one thing, can you explain me why in Combine function we need to do an extra load of the w component ? I don't understand this trick
The pascal code wants the same W value as self. If I do hlps type move it preserves whatever junk was already there, I needed to ensure some 0 values in the reg for the shuffle without clearing the reg first. So choices are op1 xor a reg op2 movhlxx to reg op3 shuffle to get 0,0,0,W op1 copy reg, op2 mask w only (large opcode 4 single single move) op1 loadss (clears upper to 0), op2 shuffle (gets a var already in local cache), bit of luck pipeline optimiser uses value already in reg.
if processor has seperate int v float or multiple piplines then all w ops will have completed before float result is done. It is quite hard to determine what the processor will do, so just give it a chance by specifying ops that could be run side by side( on an infinite pipeline machine) together using separate regs and hope the pipeline optimser does its job.
To formalise how I think about these problems look at the rough diagram attached. Thick horizontal lines are hard breaks for any chance of concurrency, in SSE term we have to wait for the pipeline to empty before we can begin the next operation. If I manipulate self at the top of the diagram to extract W then I introduce another op before I can fill the lines with the singles and shuffle ops. Delaying the load and shuffle of W should allow the op path to reduce by one. Look at all the chance there are for the pipline optimiser to give us that W shuffle operation for free. There may be a more optimal place to do the load W, i.e. where concurrency is poor, than I have placed it at the moment, but I was just 'getting the balls green" when I wrote that.
Its a lot easier just doing this in my head that describing how I do it :)
Now I have that diagram I can see another optimsation which gets rid of the Mask Res line. Change shuffle F1 to |0|F1|F1|F1| xmm order and F2 similar the we have cleared all W to 0.
That gave me an increase from 6.8 -> 7.7 on speed factor. :) and this is now hmg safe.
Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.
To formalise how I think about these problems look at the rough diagram attached Its a lot easier just doing this in my head that describing how I do it :)
Ok i understand a little bit more. we have a lattency (some docs i'read are a little bit clear in my mind, now) on the subject with SSE
Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.
Yes by comparing with my code i understand and see where was my error :)
With the same trick it will be easy for manage affine vector in an Hmg vector and ensure the result of op, no ?
Ok i made the changes in win64 imp and retest timing i've always sigsegv RDX (the result) become unaligned at one point. Changing movdqa by movdqu [Result], xmm0 solve the problem and all are green
Combine is optimised to rid of above. :( silly mistake when you have set the V2.W to 0 and are just added self.
That was my silly mistake not yours ;)
So it's ok with movdqa now, just added {$CODEALIGN LOCALMIN=16}
in vector4iTimingTest unit 👍
a Strange behaviour in Vector2f and Vector4f the function Max and Clamp, ClampSingle, raise an SIGSEGV and directly go in debugger : 0000000100212797 ff9230010000 callq *0x130(%rdx)
??????
Tests working with USE_ASM_SSE !!!!!!!!!
Very very strange, and i'm no doing any change since you've check the Win64 code last time