jdelauney / SIMD-VectorMath-UnitTest

For testing asm SIMD (SSE/SSE 2/SSE 3/SSE 4.x / AVX /AVX 2) vector math library (2f, 4f, matrix, quaternion...) with Lazarus and FreePascal Compiler
Mozilla Public License 2.0
8 stars 0 forks source link

Homogeneous Vectors. #17

Closed dicepd closed 6 years ago

dicepd commented 6 years ago

Jerome,

I would like some way of making a sanity test for library code which uses homogeneous conventions. In the past I have spent many unhappy hours tracking down other peoples bugs in code which was trying to use homogeneous conventions. Bugs such as adding two points, scaling with a vector that has not had W set to 1, these being the two most common ones I encountered. Results always 'look' ok until you try to use them further down the line.

For 4x the addition of CreateScaleVec, CreatePoint[Vec] can make code more readable.

I am not a fan of raising errors from this type of library, but if we use something like the following in pascal code only.

class operator TGLZVector4f.+ (constref A, B : TGLZVector4f) : TGLZVector4f; 
{$ifdef USE_STRICT_HOMOGENEOUS}
// code here to test for two points
// raise error if we have two points.
{$endif}
........

class operator TGLZVector4f.Scale( A : TGLZVector4f) : TGLZVector4f; 
{$ifdef USE_STRICT_HOMOGENEOUS}
 // ensure the scaling factor has W of one. self can then be vector or point.

Not only would it help in tracking down this type of bug but an end user could also use it to do a sanity check on code. I am sure you have suffered the same and get the idea.

jdelauney commented 6 years ago

For 4x the addition of CreateScaleVec, CreatePoint[Vec] can make code more readable. I am not a fan of raising errors from this type of library, but if we use something like the following in pascal code only.

I'm agree with you to add those functions but for

class operator TGLZVector4f.+ (constref A, B : TGLZVector4f) : TGLZVector4f; 
{$ifdef USE_STRICT_HOMOGENEOUS}
// code here to test for two points
// raise error if we have two points.
{$endif}

Adding more preprocessing command, i'm not a fan. I thought to integrate the routines for Vector3f (just cloning Vector4f ) and make changes for the W. in asm if we working just with 3 components we will lost performance and with the "AsVector3f" in Vector4f it will be easy to swap between Homogenous and affine

dicepd commented 6 years ago

hmm testing Negate in 4f it has suddenly struck me that W should not be negated for Hmg.

Don't know what else that would affect. I wish that record inheritance was in this version, it could be overridden for Hmg

jdelauney commented 6 years ago

Ok i see 3 maners to implement affine vector :

  movhlps  xmm1, xmm0
  movq     [Result], xmm0
  movss    [Result]8, xmm1

But those lines decrease performances a lot

Result := t1.AsVector3f; end;

**Solution 2B :**
Clone TGLZVector4f functions and manage the W direcly like you suggest but without anymore preprocessing command

* 3rd Solution : In TGLZVector4f  add a Flag var like "FAsAffine: Boolean in private and declare property AsAffine:Boolean in public

Now it each function it will needed to test this flag. but don't know how have access to this var in asm

asm mov aReg, FAsAffine test aReg, $f Jnz @DoAsAffine .... jmp @toend @DoAsAffine : .... @toend: end;



Do you see another solutions ?
I think the better is the Solution 2B.
What do you think ?
dicepd commented 6 years ago

Ok, I have been thinking somewhat on this issue. Mainly around memory alignment, usage and glarrays. I think it boils down to how we use glFloat4f v glFloat3f in arrays and the impact of

AFAIK most modern graphics cards expand 3f to 4f internally for exactly these memory alignment issues, and use homogeneous internally for rendering.

Another thought/ question I want to explore is the overhead, or hopefully lack of it, of properties for advanced record and the use of default property. This is a germ of an idea which may come to nothing.

More thinking required before rushing in here I think

dicepd commented 6 years ago

Back on the 4f v 3f issue, 4f is quicker for us, we could compact to 3f but at what cost, and is that cost greater in cpu than the time it would take for a DMA transfer of 33% more bytes to the graphics card, which takes no cpu time at all as it is handled by the DMA outside the CPU cache zone.

A point to consider when thinking about CPU mem to GPU mem.

The more I think about this issue the more I think that 3f is really an optimisation that has had its day even with not so modern hardware.

jdelauney commented 6 years ago

in "modern opengl" dat are passed thrue a pointer so first for having aligned we need to have our own getmem/freemem functions. I've found this but not really nice, i found :

procedure sse_getmem (var p:pointer;size:dword);
var temppointer:pointer;
begin
  getmem(temppointer,size+sizeof(pointer)+sse_align_mask);
  qword(p):=qword(temppointer)+sizeof(pointer);
  p:=align (p,sse_align);
  ppointer(qword(p)-sizeof(pointer))^:=temppointer;
end;
procedure sse_freemem (var p:pointer);
var temppointer:pointer;
begin
  temppointer:=ppointer(qword(p)-sizeof(pointer))^;
  freemem (temppointer);
  p:=nil;
end;

Yes i'm also prefer use vector4f. I've began a little to update GLScene and i'm completely remove affine but before test keep many many change to do before.

dicepd commented 6 years ago

What exactly is the mem structure of a glarray? 1 Just the pointer to the start of the array and the size is passed in the call 1 or does it have 'leader' member for handle storage

If 1 then old fashioned new(pointer....) should suffice as we declared alignment with type. If 2 then yes need to alloc a larger lunp of mem and set the pointer to the appropiate boundary based on sizeof(handle);

The example given does not seem to cope well with realloc for enlarging or shrinking an existing block of mem. Whereas with new at least the mem manager in pascal hopefully has a better chance of not having to do a movemem if heap allows.

dicepd commented 6 years ago

For either case this is probably a good place to use a generic class to wrap all the details of a glarray based on its growth size membersize etc and handle all of this stuff in one place then subclass for 3f 4f 4i ... variants. Advantage of generics is that calculation such as size+sizeof(pointer)+sse_align_mask happens at compile time not run time.

where sse_align_mask becomes (not (sizeof(pointer) - 1)) for xxFFFF000xx or without the not xx00000FFFxx to make it cope with whatever size or mask type we need without const.

Also with a generic we could make it behave exactly like a pascal array and add for each semantics to it. Just look at native generic array to see what I mean and see how little would be needed to adapt that to a glarray.

dicepd commented 6 years ago

Another though re generics,

TGLArray..... almost same as generic array but with handle storage.

TGLFloatArray ........ Add some way of indicating base type is float. constructor is pass through
                                 method only class
  procedure Scale(AValue: Single) ....... generic SSE scale all values whatever sub class is

{$ALL_ALIGN_16}
TGLFloat2Array specalize TGLFloatArray
   Scale(AValue: TGLZVector2f).....
   .......

{$ALL_ALIGN_16}
TGLFloat4Array specalize TGLFloatArray
  Scale(AValue: TGLZVector4f)......  scale all 

Possible hierarchy with no need for mem align in base class. Note I am still thinking in C++ STL here I don't know off the top of my head how doable this is in fpc May have to resort to pass through helpers in realised classes.

jdelauney commented 6 years ago

What exactly is the mem structure of a glarray ?

in modern opengl glarray are deprecated (as many others thing since OpenGL 3.3). Now for passing array of vector to the shader we use VAO and VBO buffer

this is an example of what i did some time ago (i've putted some extra comments) : This show you how to load data into the Modern OpenGL pipeline.

procedure TGLZCustomBufferSceneObject.DoPrepareBuffer;
var
  BufferOffset:Integer;
//  Loc:Integer;
//  AnAttrib : TDGLShaderAttrib;
begin
  if FNeedBuild then
  begin
    DoBuildMesh; // Build Mesh Datas (add vertex, indices, normals, textcoords....)
    if FVBOHandle.Handle = 0 then //Prepare VBO Buffer
    begin
      FVBOHandle.AllocateHandle;
    end
    else
    begin
      FVBOHandle.DestroyHandle;
      FVBOHandle.AllocateHandle;
    end;
    FVBOHandle.Bind; // Lock VBO Buffer

    // Allocate VBO Buffer and Set Vertices Mesh''s Data Buffer   
   // Allocate Buffer in GPU for all DATAS (vertices+normals+texCoords+...
   FVBOHandle.BufferData(nil,FMeshData.GetTotalDataSize,
                                               cGLVBOMeshModetoGLEnum[FMeshData.MeshMode]);
  // Load our buffer to GPU Buffer (Position, Sizeof_Buffer, TheBuffer
   FVBOHandle.BufferSubData(0,FMeshData.GetVerticesDataSize, FMeshData.Vertices.List);
  // Set properties of our data "array" it's the Key for manage our buffer
  // see : https://www.khronos.org/opengl/wiki/GLAPI/glVertexAttribPointer
  glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, nil);
  glEnableVertexAttribArray(0); // Set to GPU array index 0

    if MeshData.UseNormals then
    begin
      BufferOffset:=FMeshData.GetVerticesDataSize;
      glBufferSubData(GL_ARRAY_BUFFER,
                 BufferOffset,
                 FMeshData.GetNormalsDataSize,
                 FMeshData.Normals.List);

    //  FVBOHandle.BufferSubData(BufferOffset,FMeshData.GetNormalsDataSize, FMeshData.Normals.List);
       glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 0, Pointer(BufferOffset));

      glBufferData(GL_ARRAY_BUFFER,
                 FMeshData.GetNormalsDataSize,
                 FMeshData.Normals.List,
                 cGLVBOMeshModetoGLEnum[FMeshData.MeshMode]);
      Loc:=FDefaultShader.GetAttribLocation('In_Normals');
      FDefaultShader.b BindAttribLocation('In_Normals');
      AnAttrib:=FdefaultShader.getAttribInfo('In_Normals');
      glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 0, nil);
      glEnableVertexAttribArray(1);

    end;
    // repeat for texCoords, BiNormals, Tangents, Colors......

    if FEBOHandle.Handle = 0 then //Prepare EBO Buffer
    begin
      FEBOHandle.AllocateHandle;
    end;   
    FEBOHandle.Bind; //Lock EBO Buffer
    FEBOHandle.BufferData(MeshData.Indices.List,MeshData.GetIndicesDataSize, cGLVBOMeshModetoGLEnum[FMeshData.MeshMode]); //Set EBO Buffer Data

    FEBOHandle.UnBind; //UnLock EBO Buffer
    FVBOHandle.UnBind;  // UnLock VBO Buffer'
    //@TODO Free MeshDatas ??? . The Datas are now in the GPU
    FNeedBuild:=false;
  end;
end;

So instead manipulating array we must do use Pointer or Pointer of an array. For managing list, some objects from GLScene do exactly what we want and what you describe above with TGLFloat4Array specalize TGLFloatArray The base of all list is a PByteArray In GLScene (see : TGLVectorList and the parents)

Like you can see glVertexAttribPointer is very important. And we can stride buffers :)

Actually, i've already begin to rewrite the base of a 3D engine. At first, before playing with opengl i 'd like to have a software rendering engine with the same pipeline as opengl with "pure pascal shader" for rasterize trianlges/screen. I'm on a good way for this. It will be a real test crash for the vectorlib

I will upload the code later. I'm need a little more time to give you something acceptable :) ? At this stage, normaly it's completely compatible with unix :)

An another thing, now that we all have our green test :) i'd like to implement in a first time just a simple function for manipulate Array of vector (offset, scale, rotate, normalize...) and making test with SSE/AVX without extra code elements (the procedure VectorArrayXXXX commented in GLZVectorMath unit)

I smell that will run test to Red :)

jdelauney commented 6 years ago

Others thing i never use specialize and generic list first because i don't like "the code become obscur" (lol) and the 2nd it reduce performance a lot rather than static or dynamic array

dicepd commented 6 years ago

and the 2nd it reduce performance a lot rather than static or dynamic array

What I was suggesting was a generic wrapper around a pointer, most generic hide the pointer and expose their own methods, but you can use generics as inheritable pointers. For fast code you pass pointer for user base they can use nice friendly methods. I notice in GLScene you have 'fast' array methods using pointers to arrays. With a properly designed generic you would be using the pointer by default and the array access as an addon. Plus as all our routines are declared ConstRef they are already callable using a pointer.

As for strides they are easy with generics, Array of GenericArray with linked iterators, plus you can have methods like iterator pr := it.cur.prevoiusRow removes all the headache of grids in a contiguous array of type. glArray.setSize(100); glarray.colcount := 10; 10x10 grid glarray.colcount := 5; 5x20 grid. itn := cur.nextrow. // itn now hold pointer to item/stride in next row.

For tri gen these methods reduce bugcount dramatically.

dicepd commented 6 years ago

I smell that will run test to Red :)

Well I have held off Matrix testing as nothing seems to work. As it uses a lot of underlying vector code, I wanted to ensure I could trust the vector code before I tackled matrix.

jdelauney commented 6 years ago

What I was suggesting was a generic wrapper around a pointer, most generic hide the pointer and expose their own methods, but you can use generics as inheritable pointers.

Ok, i'm little lost with the term "Generics"

So if i understand well the prototype could be :

Type
   TGenericArray = class
   private
        FData : Pointer; //PByte
        FColoCount : Integer;
        FRowCount : Integer;
        FCursor : Integer; // Where we are in array
        FElementSize : Integer; // Size of one element in array

   public
       Constructor Create(Const aSize:Longword; aElementSize : Integer);overload;

       procedure setSize(Const aSize:Longword);
       function PrevRow : Pointer;
       function NextRow: Pointer; 
       function GetRow(NumofRow : Integer):Pointer

       property ColCount : Integer Read FColCount Write SetColCount;
       property RowCount : Integer Read FRowCount Write SetRowCount;
       property ElementSize : Integer Read FElementSize; // Write SetElementSize;
   end;

   TGLZVector4fArray = class(TGenericArray)
   private
        procedure SetItem(Const AValue : TGLZVector4f);
        function GetItem(Const anIndex:Integer):TGLZVector4f;
   protected
       function GetItemPtr:PGLZVector4f; virtual; 
   public
        Constructor Create;
        procedure Scale(x: Single);
        procedure Scale(x:TGLZVector4f);

        property RowItem[aRow, ACol: Integer] : TGLZVector4f Read GetRowItem Write SetRowItem;
        property Item[anIndex : LongWord]: TGLZVector4f read GetItem Write SetItem; 
  end;

  ex generic :
 var 
   AValue : Single;

 glArray.Create(100,4);
 glArray.ColCount := 10;

  For I := 0 to glArray.RowCount-1 do
  begin
       ARow := glArray.NextRow; //glArray.GetRow(I);
       for j:=0 to glArray.ColCount-1 do
       begin
           AValue :=PSingle(ARow)^;
           AValue := AValue * AValue;
           ARow^:=AValue;
           Inc(ARow, glArray.ElementSize);
      end;
end;

Dou you think at something like that ? if yes it will be easy to implement my "Fast Bitmap managment" use the same style.

dicepd commented 6 years ago

I am a bit short of time atm till Saturday, I will see then if I can produce something along the lines of what I am talking about.

dicepd commented 6 years ago

Finally after many distractions today.

ArrayTest.zip

Far from complete but gives an idea of a pointer vector array see Test procedure in GLZArrays and have a play with that

Just add these files to the test project for now they are nothing like the file layout for final, I just tried to keep it to two files. One to handle the core 'array' which had no deps on project and the other which introduced VectorMath.

jdelauney commented 6 years ago

Ok i've made something functionnal and a little bit more simple and generic from your code. Generic save us many code

But before i post the code. What do you think if we refine a little bit the structure of the library like this ?

or something like that ?

dicepd commented 6 years ago

I had the idea that we could do libraries of functional blocks such as

Basically get these done and in libs so as we move on we are playing with new code with less building, and 'leave' working tested code behind. [I know we will be adding bits but I have my libs docked with my project so all files are accessible and lazarus just knows when it has to rebuild a lib if I edit it]

I know that GLScene seemed to have issues with sectioning code in this way, from a legacy and slow buildup of intertwined code to perceived problems in distribution.

With FPC and Lazarus the code directory thing handles the dependency ordering of projects it hosts, sort of a manifest of dependencies it will handle downloads for. So multiple libs are not a distribution or new user issue like they used to be.

Plus separating into libs generally forces you to make clearer separations in the functionality hierarchy from the beginning so you do end up with a cleaner object model that is easier to maintain in the longer term. [on one contract I had, I took six month to sort out one clients codebase, took his buildtime time from two days of pain, to push button B and auto.]

Just my thoughts. ;)

jdelauney commented 6 years ago

I had the idea that we could do libraries of functional blocks such as

  • Utils, [profiler cpuinfo],
  • BaseMath [sse vectors, matrix fastmath]

It's from root or a parent folder ?

Basically get these done and in libs so as we move on we are playing with new code with less building, and 'leave' working tested code behind. [I know we will be adding bits but I have my libs docked with my project so all files are accessible and lazarus just knows when it has to rebuild a lib if I edit it]

I'm understand, it's not a problem for me with my project

I know that GLScene seemed to have issues with sectioning code in this way, from a legacy and slow buildup of intertwined code to perceived problems in distribution.

This is why i've splitted back GLScene source into several sub folder. For can i maintain each part like sub lib more easly

With FPC and Lazarus the code directory thing handles the dependency ordering of projects it hosts, sort of a manifest of dependencies it will handle downloads for. So multiple libs are not a distribution or new user issue like they used to be.

Yes and for my project i'm going more deeply i've spit each "theme" to be independant with only one dependency "the common part" (included the vector lib, but i'm thinking i'm also going to split it into sub-lib). At final all my "sub-lib" have the same scheme for folders. And all can be merge into one big lib for final release.

Plus separating into libs generally forces you to make clearer separations in the functionality hierarchy from the beginning so you do end up with a cleaner object model that is easier to maintain in the longer tern.

Yes, it's like i'm working with my project

Just my thoughts. ;)

me to :)

So what the best for you ?

dicepd commented 6 years ago

Split away in folders and we can decide on what to go in what lib later

Make those calls when we want to make a release.