jdelauney / SIMD-VectorMath-UnitTest

For testing asm SIMD (SSE/SSE 2/SSE 3/SSE 4.x / AVX /AVX 2) vector math library (2f, 4f, matrix, quaternion...) with Lazarus and FreePascal Compiler
Mozilla Public License 2.0
8 stars 0 forks source link

Real Test Case : BoïdZ Demo #12

Closed jdelauney closed 6 years ago

jdelauney commented 6 years ago

I've made a little demo for testing our Vectors


I think the problem is due on some out of range and surely cause from bad data alignment

Must see what is wrong,, but i don't really say how to debug and trace these "badass" behaviours

dicepd commented 6 years ago

umainform.pas(291,23) Error: identifier idents no member "ST"


Ok found it, nothing is visible but it is doing 16.4 fps according to the titlebar

dicepd commented 6 years ago

Ok finally got it working by using Mainform.Canvas direct. Can run in both 64 bit native and 64 bit SSE.

Though not much speedup.

dicepd commented 6 years ago

Ok a bit more testing and yes when I get the setting right to max cpu, turn cadencer down, I get more fps with SSE, but only about 20%, but I suspect most of the time is spent drawing rather than in the AnimateScene.

jdelauney commented 6 years ago

Can run in both 64 bit native and 64 bit SSE. Ok a bit more testing and yes when I get the setting right to max cpu, turn cadencer down, I get more fps with SSE, but only about 20%

Ouch !!!! In win64 with SSE i'm only see a straight line from topleft to bottomleft (same case with direct Canvas)

What do you do ?????

dicepd commented 6 years ago
  mainform.Canvas.Clear;   <-------
  for i:=0 to maxboidz do
    b := FBoidz[i];
    p := b.Round;

    //calcul de la direction de déplacement pour la couleur
     //  CurColor := FBitmapBuffer.ColorManager.Palette.Colors[c].Value;
    CurColor := FColorMap[c];
//    with FBitmapBuffer.Canvas do
     with MainForm.Canvas do            <---------
      Pen.Style := psSolid;
      Pen.Color :=  CurColor;
      // dessine un traits de la longueur de la vitesse
 // MainForm.Refresh;

  (* With FBitmapBuffer.Canvas do                             

The only code change to see the swarm. Apart from settings for engine, only other thing I changed was I added -Sv and -O3 in the options.

jdelauney commented 6 years ago

What's the hell with windows, for once time ???? On 1srt what i can see without "-dUSE_ASM_SSE_3" options, and with on 2dn image 2018-01-08_203526


and with your change the windows still empty nothing is shown

it's silly !!!!!!!

dicepd commented 6 years ago

Let me get my windows box going, I'll get back to you if I find something.

jdelauney commented 6 years ago

And now what i see with my own bitmap management without SSE3 and With



the 2nd it the same out also with you're changes

It's really really strange !!!! I'll add a log to see the results of operations

jdelauney commented 6 years ago

Ok now by checking the Range checking" (-Cr) in debug options this is what i see :


But see the FPS :around 1.5 !!!!!!! so something is wrong or something happens under win64 with current SSE code

jdelauney commented 6 years ago

i've just found this : https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz

perhaps a beginning of answer

With my poor english i don't understand all well :(

dicepd commented 6 years ago

My windows box just sees a stripe from top left to bottom right too. It is as if something is clamping it to x = y scaled to window size.

dicepd commented 6 years ago

Check your win64 vector4f round and trunc they are so wrong! It now works for me in win64 after fixing those up.

jdelauney commented 6 years ago

So i've play a bit with MXSCR register and the problem seems to come from the TGLZVector2f./ a SIGFPE is raise on the line : divps xmm0, xmm1 . After replace by native code it seems also have a problem with TGLZVector4f. But what ???? need more deep search

jdelauney commented 6 years ago

Ok you're right Peter the round function cause the problem

function TGLZVector4f.Round: TGLZVector4i;assembler;nostackframe;register;
  // Rounding mode defaults to round-to-nearest
  movaps   xmm0, [RCX]
  cvtps2dq xmm0, xmm0
  movdqa     [RDX],  xmm0

Don't say how to fix it...... but with Native code, always have the same error i've described before with the TGLZVector2f./

jdelauney commented 6 years ago

Ok finally

function TGLZVector4f.Round: TGLZVector4i;assembler;//nostackframe;register;
  // Rounding mode defaults to round-to-nearest
  movaps   xmm0, [RCX]
  cvtps2dq xmm0, xmm0
  movdqa   [Result],  xmm0

is working. But to have BoidZ work i need to change this :

class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; assembler; nostackframe; register;
  movq  xmm0, [A]
  movq  xmm1, [B]
  divps xmm0, xmm1  // SIGFPE raise here
 movq  [Result], {%H-}xmm0


class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; 
  result.x := a.x /b.x;
  result.y := a.y /b.y;
jdelauney commented 6 years ago

Ok i found the error is in the function above in SSE the divisor for Hi is not set and equal to 0 so it's normal a SIGFPE raised so the trick

class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; assembler; //nostackframe; register;
  movq  xmm0, [A]
  movq  xmm1, [B]
  movlhps xmm1,xmm1 //--- Fill upper register
  divps xmm0, xmm1
  movq     RAX,  xmm0

now is working gain is not high without asm the average FPS is 18.55 with SSE enabled FPS is 19,97

dicepd commented 6 years ago

Ok I profiled this with valgrind and tbh I am surprised we got any speedup. less than 15% of time in calls to SSE code but CheckAngleofView is the killer for this, really bad design as it ends up doing Math.ArcTan2 through another call, so there is a whole stackframe around a call to a native pascal FPU function. And this is called a lot. then directly afterwards we have a call to SSE lengthSqr. In fact this one call to LengthSquare is the bulk of the calls to the SSE library.

So all in all probably not a good choice for a speedup demo.

dicepd commented 6 years ago

So a switch to the FastArcTangent2 and 18fps becomes 46fps

dicepd commented 6 years ago

After stripping out all the profiling and setting a window size(fps changes with size) I get ~ 33fps native and 42fps SSE

jdelauney commented 6 years ago

Ok after some minor changes with