jdelauney / SIMD-VectorMath-UnitTest

For testing asm SIMD (SSE/SSE 2/SSE 3/SSE 4.x / AVX /AVX 2) vector math library (2f, 4f, matrix, quaternion...) with Lazarus and FreePascal Compiler
Mozilla Public License 2.0
8 stars 0 forks source link

Real Test Case : BoïdZ Demo #12

Closed jdelauney closed 6 years ago

jdelauney commented 6 years ago

I've made a little demo for testing our Vectors

Results

I think the problem is due on some out of range and surely cause from bad data alignment

Must see what is wrong,, but i don't really say how to debug and trace these "badass" behaviours

dicepd commented 6 years ago

umainform.pas(291,23) Error: identifier idents no member "ST"

  MoveTo(P.ST.x,P.ST.y); 

Ok found it, nothing is visible but it is doing 16.4 fps according to the titlebar

dicepd commented 6 years ago

Ok finally got it working by using Mainform.Canvas direct. Can run in both 64 bit native and 64 bit SSE.

Though not much speedup.

dicepd commented 6 years ago

Ok a bit more testing and yes when I get the setting right to max cpu, turn cadencer down, I get more fps with SSE, but only about 20%, but I suspect most of the time is spent drawing rather than in the AnimateScene.

jdelauney commented 6 years ago

Can run in both 64 bit native and 64 bit SSE. Ok a bit more testing and yes when I get the setting right to max cpu, turn cadencer down, I get more fps with SSE, but only about 20%

Ouch !!!! In win64 with SSE i'm only see a straight line from topleft to bottomleft (same case with direct Canvas)

What do you do ?????

dicepd commented 6 years ago
  FBitmapBuffer.canvas.Brush.color:=clBlack;
  FBitmapBuffer.canvas.FillRect(clientrect);
  mainform.Canvas.Clear;   <-------
  for i:=0 to maxboidz do
  begin
    b := FBoidz[i];
    p := b.Round;
    //P.X:=Round(FBoidz[i].ST.X);
    //P.Y:=Round(FBoidz[i].ST.Y);

    //calcul de la direction de déplacement pour la couleur
    c:=round(ArcTan2(b.UV.X,b.UV.Y)*180/cPi)+180;
    //c:=345;
     //  CurColor := FBitmapBuffer.ColorManager.Palette.Colors[c].Value;
    CurColor := FColorMap[c];
//    with FBitmapBuffer.Canvas do
     with MainForm.Canvas do            <---------
    begin
      Pen.Style := psSolid;
      Pen.Color :=  CurColor;
      // dessine un traits de la longueur de la vitesse
      MoveTo(P.ST.x,P.ST.y);
      LineTo(P.ST.x+P.UV.x,P.ST.y+P.UV.y);
    end;
   end;
 // MainForm.Refresh;

  (* With FBitmapBuffer.Canvas do                             

The only code change to see the swarm. Apart from settings for engine, only other thing I changed was I added -Sv and -O3 in the options.

jdelauney commented 6 years ago

What's the hell with windows, for once time ???? On 1srt what i can see without "-dUSE_ASM_SSE_3" options, and with on 2dn image 2018-01-08_203526

2018-01-08_203750

and with your change the windows still empty nothing is shown

it's silly !!!!!!!

dicepd commented 6 years ago

Let me get my windows box going, I'll get back to you if I find something.

jdelauney commented 6 years ago

And now what i see with my own bitmap management without SSE3 and With

2018-01-08_203526_cr

2018-01-08_204619

the 2nd it the same out also with you're changes

It's really really strange !!!! I'll add a log to see the results of operations

jdelauney commented 6 years ago

Ok now by checking the Range checking" (-Cr) in debug options this is what i see :

2018-01-08_205718

But see the FPS :around 1.5 !!!!!!! so something is wrong or something happens under win64 with current SSE code

jdelauney commented 6 years ago

i've just found this : https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz

perhaps a beginning of answer

With my poor english i don't understand all well :(

dicepd commented 6 years ago

My windows box just sees a stripe from top left to bottom right too. It is as if something is clamping it to x = y scaled to window size.

dicepd commented 6 years ago

Check your win64 vector4f round and trunc they are so wrong! It now works for me in win64 after fixing those up.

jdelauney commented 6 years ago

So i've play a bit with MXSCR register and the problem seems to come from the TGLZVector2f./ a SIGFPE is raise on the line : divps xmm0, xmm1 . After replace by native code it seems also have a problem with TGLZVector4f. But what ???? need more deep search

jdelauney commented 6 years ago

Ok you're right Peter the round function cause the problem

function TGLZVector4f.Round: TGLZVector4i;assembler;nostackframe;register;
asm
  // Rounding mode defaults to round-to-nearest
  movaps   xmm0, [RCX]
  cvtps2dq xmm0, xmm0
  movdqa     [RDX],  xmm0
end; 

Don't say how to fix it...... but with Native code, always have the same error i've described before with the TGLZVector2f./

jdelauney commented 6 years ago

Ok finally

function TGLZVector4f.Round: TGLZVector4i;assembler;//nostackframe;register;
asm
  // Rounding mode defaults to round-to-nearest
  movaps   xmm0, [RCX]
  cvtps2dq xmm0, xmm0
  movdqa   [Result],  xmm0
end;

is working. But to have BoidZ work i need to change this :

class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; assembler; nostackframe; register;
asm
  movq  xmm0, [A]
  movq  xmm1, [B]
  divps xmm0, xmm1  // SIGFPE raise here
 movq  [Result], {%H-}xmm0
end;

by

class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; 
begin
  result.x := a.x /b.x;
  result.y := a.y /b.y;
end; 
jdelauney commented 6 years ago

Ok i found the error is in the function above in SSE the divisor for Hi is not set and equal to 0 so it's normal a SIGFPE raised so the trick

class operator TGLZVector2f./(constref A, B: TGLZVector2f): TGLZVector2f; assembler; //nostackframe; register;
asm
  movq  xmm0, [A]
  movq  xmm1, [B]
  movlhps xmm1,xmm1 //--- Fill upper register
  divps xmm0, xmm1
  movq     RAX,  xmm0
end;

now is working gain is not high without asm the average FPS is 18.55 with SSE enabled FPS is 19,97

dicepd commented 6 years ago

Ok I profiled this with valgrind and tbh I am surprised we got any speedup. less than 15% of time in calls to SSE code but CheckAngleofView is the killer for this, really bad design as it ends up doing Math.ArcTan2 through another call, so there is a whole stackframe around a call to a native pascal FPU function. And this is called a lot. then directly afterwards we have a call to SSE lengthSqr. In fact this one call to LengthSquare is the bulk of the calls to the SSE library.

So all in all probably not a good choice for a speedup demo.

dicepd commented 6 years ago

So a switch to the FastArcTangent2 and 18fps becomes 46fps

dicepd commented 6 years ago

After stripping out all the profiling and setting a window size(fps changes with size) I get ~ 33fps native and 42fps SSE

jdelauney commented 6 years ago

Ok after some minor changes with