jdelauney / SIMD-VectorMath-UnitTest

For testing asm SIMD (SSE/SSE 2/SSE 3/SSE 4.x / AVX /AVX 2) vector math library (2f, 4f, matrix, quaternion...) with Lazarus and FreePascal Compiler
Mozilla Public License 2.0
8 stars 0 forks source link

Matrices #29

Open dicepd opened 6 years ago

dicepd commented 6 years ago

As you might have guessed from the title I have moved onto Matrix testing.

Testing the inverse function I noticed it is using a Adjunct / Det method of calculating the inverse. Maybe we need to find a row reduction method as well to test. In pascal it would be far more efficient, though in SSE not so sure as lots of branching in row reduction.

jdelauney commented 6 years ago

It will be difficult inverse function translated from Intel. It's already optimized but i found this article https://lxjk.github.io/2017/09/03/Fast-4x4-Matrix-Inverse-with-SSE-SIMD-Explained.html perhaps wee'll can find a way. After we'll can surely optimize multiplication and transpose..

Some others links i found :

But best optimizations, i think will come from AVX :

jdelauney commented 6 years ago

Another links i've just found

dicepd commented 6 years ago

Ok I just finished optimising the pascal code for Inverse. I have a SF of 4.5 by just coding it properly in pascal :)

dicepd commented 6 years ago

I did the same to deteminant, not as dramatic and as Invert uses it, this reduces the Invert speed factor a tad. The changed routine should be easier to translate to SSE as well as they are all 'inline'

jdelauney commented 6 years ago

Yep, i forget we can use mXX :) But now TestInvert is breaked it is normal ?

jdelauney commented 6 years ago

ok sorry the functionnal test is break not normal test. And another strange thing.when i launch a test for 1st time only Invert is breaked but if i run tests 2nd time ,AddMatrix, Add Single, SubMatrix, SubSingle, and getDeterminant failed

dicepd commented 6 years ago

No idea why the second run does that.... As to Invert it is about as broken as it can be Intel code is incomprehensible I am slowly working through https://lxjk.github.io/2017/09/03/Fast-4x4-Matrix-Inverse-with-SSE-SIMD-Explained.html#_appendix this version with your first attempt as a base, and sorting out the many bugs in that. This may take quite a while to work through.

jdelauney commented 6 years ago

It's clear i(ve passed more one day last time with invert. https://godbolt.org can help for converting C intrinsic to asm code

dicepd commented 6 years ago

Small progress, I finally got the det working, just hope I have enough regs left.

Some of those C intrinsics and macros expand quite a bit even ignoring the stack pushes on godbolt.

jdelauney commented 6 years ago

Attention to the number of register available in 32bit xmm0-to 7 and in 64bit xmm0 to 15. But if we do not have 32bit SSE version, it's not really serious because these systems are more and more obsolete. And you've already improved native code

dicepd commented 6 years ago

I have it nearly working it passes all test that have a non zero Det however it is calculating a det for matrix that should have a det of zero. 14 Registers used in this. Not much chance of keeping it to 8 will have to use that stack for 32bit.

jdelauney commented 6 years ago

Good news.14 is not bad, some native code use more vars (often 16). And Don't worry about the 32bit version

I fall on this : https://softwareengineering.stackexchange.com/questions/305908/which-algorithm-is-performant-for-matrix-multiplication-of-4x4-matrices-of-affin it may be of interest to you

dicepd commented 6 years ago

Yeah I looked at the strassen algo but it is only good for single threaded not a good fit for SIMD We will see the best boost by keeping reg usage down (apart from setup) and optimsing array ops for the most common usage patterns. If you have noticed some of my code has already been indicating the loop unrolling factor for arrays.

dicepd commented 6 years ago

Ok done, it has a SF of 12.381953 times faster than original code, and 3.165963 over the new optimised pascal code.

dicepd commented 6 years ago

Your new helper has got me confused? I would expect something named TGLZVector2D to be a vector of doubles, however it is declared as

  TGLZVector2D = TGLZVector2f;
  PGLZVector2D = ^TGLZVector2D;             

This will lead to great confusion. We (me at least ) will want some double variants going forward, would it not be better to have standard double class even if it is not fully fleshed out as this point.

jdelauney commented 6 years ago

Sorry, i didn't have many time this afternoon. You matrix invert is very interresting i don't understand all yet.

Ok done, it has a SF of 12.381953 times faster than original code, and 3.165963 over the new optimised pascal code.

I have the same SF :) (3,092384)

Your new helper has got me confused? I would expect something named TGLZVector2D to be a vector of doubles, however it is declared as

I understand this is why i capitalize the "D" So we can also rename TGLZVectorHelper to TGLZVector4fHelper, and do the same with TGLZVector2f"Helper" no problem for me and it will less confusing. 👍

dicepd commented 6 years ago

New opt for you to test with Matrix multiply Vector. I am now at 2.761040 SF

jdelauney commented 6 years ago

For me 5 tests failed :

I don't see all change you've made. I've just copied new multiply Matrix by vector from Unix64 to win64. Just Change RDI by A or RDX is the same

With older version i've 5 other tests failed

dicepd commented 6 years ago

Ok I have just gotten round to RotationMatricies, this will need some careful testing by me then. I was starting to see something strange,which I am trying to get to the bottom of, I knew that getting all the rotations right was going to be a pain!!! I just have Matrix tests swithed on at the moment, will put Quat back on as well now.

dicepd commented 6 years ago

Ok I have got to the root of the problem I think, The routine is declared as class operator TGLZMatrix4f.*(constref A: TGLZMatrix4f; constref B: TGLZVector4f): TGLZVector4f;

Which to me suggests Matrix * Vector

What was originally coded was Vector Matrix and these two definitions are not the same. I coded tests for MV and changed pascal and assembler to suit.

As I have got further into the test, I am also finding that most ops need Vector * Matrix

I need to do a lot of alterations and tests to find out which is dominant. If org code works better then I will have to rework tests and assembler but I will change the declaration to reflect the fact that we are doing V M rather than a M V

dicepd commented 6 years ago

Ok have completed that bit, we now have consistent rotations and as a side effect Transform now wants the numbers where they should be in [x,y,z].W as per the standard math books.

jdelauney commented 6 years ago

class operator TGLZMatrix4f.*(constref A: TGLZMatrix4f; constref B: TGLZVector4f): TGLZVector4f;

Which to me suggests Matrix * Vector

What was originally coded was Vector Matrix and these two definitions are not the same. I coded tests for MV and changed pascal and assembler to suit.

I've noticed this at the begining of dev but not notice and take care that order of mul will be important

All is green now :)

Some change is necessary in native code. it have 2 or 3 place where the transform is compute MatVector instead of VectorMat.

In the same way, like you've noticed with SmoothStep we can also add operator Scalar, Vector in Addition to Vector, Scalar

dicepd commented 6 years ago

Right after reviewing all these changes I have found an error in my logic in getting this working. I based MV versus VM based on the first encounter of where it mattered the non diagonal Transform. Getting this working originally got me stuck in the VM mode It Should actually be MV so hopefully by the time you read this I will have fixed it up correctly.

dicepd commented 6 years ago

I am going to open a big can of worms here I suspect ;)

Keep in mind we are doing slower M.V again after last changes.

I have just done the timing test for M.V versus V.M on my win64 with an i5-2500 I get large performance boost for V.M ~16ms ->~10ms. For my older AMD FX(tm)-8120 Eight-Core (never that good on fpu performance) 19.5ms -> 17.8ms.

So it would seem there is an advantage in using Col-Major vectors for iteration over arrays of 4f. We have two choices

  1. M-T then use V.M I think that will work will write some tests to make sure.
  2. Base Internal matrices as col-major.

I hope the M-T and V.M works ok as it will be easier to implement and one transform before hitting a large array will be a lot cheaper that the transforms that are done in M.V.

dicepd commented 6 years ago

Happy days again Transpose then Vec * Mat works fine. :+1: It would seem that this is also faster than quaternion transform.

jdelauney commented 6 years ago

All ok for me to. Speed factor for :

dicepd commented 6 years ago

LookAt matrix done, efficiencies gained from using true math, removal of one transpose, have changed CrossProduct in 4f, this should return a vector not a point. Tests expanded to show this works SIMD

Check I have the intent of LookAt correct, comments in the functional test. Another pascal improvement commented in native inc, need to test these when/if we convert to asm. Already has SF of 2.3 by just using existing routines done in asm.

edit: updated for wrong assumption should behave as expected more code removed.

dicepd commented 6 years ago

And another to check over the tests. I am going to start analysing the what the code is doing and try to find out why the last tthree tests in CreateParallelProjectionMatrix fails.

jdelauney commented 6 years ago

I'm just updload your latest tests. All work for me except AABB to ClipRect in Boundingbox

jdelauney commented 6 years ago

this is the result : AABB To ClipRect does not match : native : (X: 105,28728 ,Y: 72,80651 ,Z: 111,63911 ,W: 83,38283) <>(X: 549,40002 ,Y: -1583,59985 ,Z: 820,59991 ,W: -1312,39990)

jdelauney commented 6 years ago

and with matrix result :

AABB To ClipRect does not match : (X: 105,28728 ,Y: 72,80651 ,Z: 111,63911 ,W: 83,38283)<=>(X: 549,40002 ,Y: -1583,59985 ,Z: 820,59991 ,W: -1312,39990)

nMat : |(X: 1,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 1,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 8,51200 ,W: 1,00000)|

aMat : |(X: 1,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 1,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 8,51200)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 1,00000)|

jdelauney commented 6 years ago

the native code doesn't reflect TGLZMAtrix4f CreateParallelProjectionMatrix i think you've made correction in TGMAtrix4f. And this is the right CreateParallelProjectionMatrix function ?

dicepd commented 6 years ago

Ok back from my little break, I see you have fixed the above issue.

I also see the Fustrum definintion but no code as yet, I have dummied a file with the defs for now so I can compile.

We are now at the stage in testing Matrices where we need to decide what pipelines we need to implement, looking at the next on my list the CreateOrtho and CreatePerspective these are currently biased toward GL. Do we need GDI style Creates and do you know what Vulcan is expecting in this area? (being Kronos developed I would expect this to follow GL but have not read any specs on Vulcan as yet)/

jdelauney commented 6 years ago

Hi,

I also see the Fustrum definintion but no code as yet

huch ! i don't take care something happens with my last commit

jdelauney commented 6 years ago

Sorry, i've made strange manipulations. So now it's ok

Do we need GDI style Creates and do you know what Vulcan is expecting in this area? (being Kronos developed I would expect this to follow GL but have not read any specs on Vulcan as yet)/

GDI, will use the same scheme as OpenGL., and we turned to a right hand sytem. So not a problem. Vulkan following the same scheme as OpenGL sor our function are good

jdelauney commented 6 years ago

Just in waiting i've finished the pipeline for shader and work very well in software mode. Now i must finished the 3D rendering pipeline, but I need to think again at certain points.

the first screenshoot is a displaced raymarching sphere with phong illumination and second an effect from http://glslsandbox.com/e#45003.0 ( you can't see but it's animated)

2018-02-12_171009

2018-02-12_170843

The pit of doom here is the sin/cos procedure and i cannot use fastmath for the raymarching. i must find a way to stay in range of -pi..pi or find and another way to compute fast SinCos with more accuracy

Take a look

2018-02-12_171514

jdelauney commented 6 years ago

I have some questions for you about matrices. I not find a good answer on web. I'd like to introduce Affine Matrix for transforming 2D points.

if i use TGLZMatrix4f as base represented by

[ a, b, c, d ] [ e, f, g, h ] [ i, j, k, l ] [ 0, 0, 0, 1 ]

i representing in my mind this structure TGLZMatrix2f = Class(TGLZMatrix3f) = class(TGLZMatrix4f) as affine like this

[ a, b, tx, 0 ] [ c, d, ty, 0 ] [ 0, 0, 1, 0 ] [ 0, 0, 0 0 ]

my question is the opreations like mul, transpose, invert.... in TGLZMatrix4f working as is if i set data like above or i must code new classes functions specially for matrix 3x3 and 2x2 ?

dicepd commented 6 years ago

using a 4x4 matrix on a vector4f will work for mult det and transpose if

if tx,ty are 0 then det will be det of abef as long as k and [4,4] are 1.

Transpose will always work as it just reflects around the unit diagonal.

As to tx and ty I do not know but will have to test that they ok in the pos given or need to move to d and h ( I have a sneaking suspicion that they will need to be in d and h. Translate works fine where you have Tx and Ty

a and d are scaleX and scaleY as usual.

As to invert will have to look into that I cannot say off the top of my head. Inverse works fine Will update this as I test.

so we have

. 0 1 2 3
0 ScaleX RotA TransX Zero
1 RotA ScaleY TransY Zero
2 Zero Zero One Zero
3 Zero Zero Zero One

We will however need new methods to Create Rot Create Scale Create Transform as existing code either puts Transform in wrong place for this or stomps over where we want TransX and TransY to be.

jdelauney commented 6 years ago

Thanks Peter is more clear now for me.

Now the question on how do implementation :

We must simply add the creation functions such as 'create2DMatrix (Tx, Ty, Rx, Ry, Sx, Sx)' in TGLZMatrixHelper. Or it is better to duplicate the code and adapt the code asm according to the size of the matrix (3x3) as you did with TGLZHmgPlane? what do you think ?

dicepd commented 6 years ago

Do another class rather than a helper as many optimisations can be done for 2D work

Still do not like the 2D naming as it indicates Array of double or Vector2 of double it is realy a 2f helper as we may also need a 2i helper as well for 2D work.

I will need a Vector2d going forward due to certain routines not having enough precision. FPUs can get away with singles as they do tend to extend bits somewhat and the reverse polish stack oriented usage of fpus does lead to less rounding errors than we get using SIMD instructions which are all done in isolation and compound errors more quickly than fpu stack.

jdelauney commented 6 years ago

Do another class rather than a helper as many optimisations can be done for 2D work

Ok i'begin in this way

Still do not like the 2D naming as it indicates Array of double or Vector2 of double it is realy a 2f helper as we may also need a 2i helper as well for 2D work.

No problem for me i'm open :) rename like you want. I follow you.

I will need a Vector2d going forward due to certain routines not having enough precision. FPUs can get away with singles as they do tend to extend bits somewhat and the reverse polish stack oriented usage of fpus does lead to less rounding errors than we get using SIMD instructions which are all done in isolation and compound errors more quickly than fpu stack.

If we need to work with FPU for Double.I must already have some asm code in a corner of my hd

dicepd commented 6 years ago

We can use double in SIMD it has lots of double oriented instructions, at least for 2D vectors, they still fit into 128 xmm register.

jdelauney commented 6 years ago

Yes, it will be not very difficult to write now. in most case we just need change ps by pd