Open dicepd opened 6 years ago
It will be difficult inverse function translated from Intel. It's already optimized but i found this article https://lxjk.github.io/2017/09/03/Fast-4x4-Matrix-Inverse-with-SSE-SIMD-Explained.html perhaps wee'll can find a way. After we'll can surely optimize multiplication and transpose..
Some others links i found :
https://www.gamedev.net/forums/topic/621951-sse-4x4-matrix-transpose-and-invert/
https://www.gamedev.net/forums/topic/652801-mat4-inverse-sse-optimization/
https://nfrechette.github.io/2017/04/13/modern_simd_matrix_multiplication/
https://pdfs.semanticscholar.org/bc38/c1cd099fb4a1c762d4805f69c545cca5953f.pdf (a little bit to difficult for me, to understand)
But best optimizations, i think will come from AVX :
Ok I just finished optimising the pascal code for Inverse. I have a SF of 4.5 by just coding it properly in pascal :)
I did the same to deteminant, not as dramatic and as Invert uses it, this reduces the Invert speed factor a tad. The changed routine should be easier to translate to SSE as well as they are all 'inline'
Yep, i forget we can use mXX :) But now TestInvert is breaked it is normal ?
ok sorry the functionnal test is break not normal test. And another strange thing.when i launch a test for 1st time only Invert is breaked but if i run tests 2nd time ,AddMatrix, Add Single, SubMatrix, SubSingle, and getDeterminant failed
No idea why the second run does that.... As to Invert it is about as broken as it can be Intel code is incomprehensible I am slowly working through https://lxjk.github.io/2017/09/03/Fast-4x4-Matrix-Inverse-with-SSE-SIMD-Explained.html#_appendix this version with your first attempt as a base, and sorting out the many bugs in that. This may take quite a while to work through.
It's clear i(ve passed more one day last time with invert. https://godbolt.org can help for converting C intrinsic to asm code
Small progress, I finally got the det working, just hope I have enough regs left.
Some of those C intrinsics and macros expand quite a bit even ignoring the stack pushes on godbolt.
Attention to the number of register available in 32bit xmm0-to 7 and in 64bit xmm0 to 15. But if we do not have 32bit SSE version, it's not really serious because these systems are more and more obsolete. And you've already improved native code
I have it nearly working it passes all test that have a non zero Det however it is calculating a det for matrix that should have a det of zero. 14 Registers used in this. Not much chance of keeping it to 8 will have to use that stack for 32bit.
Good news.14 is not bad, some native code use more vars (often 16). And Don't worry about the 32bit version
I fall on this : https://softwareengineering.stackexchange.com/questions/305908/which-algorithm-is-performant-for-matrix-multiplication-of-4x4-matrices-of-affin it may be of interest to you
Yeah I looked at the strassen algo but it is only good for single threaded not a good fit for SIMD We will see the best boost by keeping reg usage down (apart from setup) and optimsing array ops for the most common usage patterns. If you have noticed some of my code has already been indicating the loop unrolling factor for arrays.
Ok done, it has a SF of 12.381953 times faster than original code, and 3.165963 over the new optimised pascal code.
Your new helper has got me confused? I would expect something named TGLZVector2D to be a vector of doubles, however it is declared as
TGLZVector2D = TGLZVector2f;
PGLZVector2D = ^TGLZVector2D;
This will lead to great confusion. We (me at least ) will want some double variants going forward, would it not be better to have standard double class even if it is not fully fleshed out as this point.
Sorry, i didn't have many time this afternoon. You matrix invert is very interresting i don't understand all yet.
Ok done, it has a SF of 12.381953 times faster than original code, and 3.165963 over the new optimised pascal code.
I have the same SF :) (3,092384)
Your new helper has got me confused? I would expect something named TGLZVector2D to be a vector of doubles, however it is declared as
I understand this is why i capitalize the "D" So we can also rename TGLZVectorHelper to TGLZVector4fHelper, and do the same with TGLZVector2f"Helper" no problem for me and it will less confusing. 👍
New opt for you to test with Matrix multiply Vector. I am now at 2.761040 SF
For me 5 tests failed :
I don't see all change you've made. I've just copied new multiply Matrix by vector from Unix64 to win64. Just Change RDI by A or RDX is the same
With older version i've 5 other tests failed
Ok I have just gotten round to RotationMatricies, this will need some careful testing by me then. I was starting to see something strange,which I am trying to get to the bottom of, I knew that getting all the rotations right was going to be a pain!!! I just have Matrix tests swithed on at the moment, will put Quat back on as well now.
Ok I have got to the root of the problem I think, The routine is declared as class operator TGLZMatrix4f.*(constref A: TGLZMatrix4f; constref B: TGLZVector4f): TGLZVector4f;
Which to me suggests Matrix * Vector
What was originally coded was Vector Matrix and these two definitions are not the same. I coded tests for MV and changed pascal and assembler to suit.
As I have got further into the test, I am also finding that most ops need Vector * Matrix
I need to do a lot of alterations and tests to find out which is dominant. If org code works better then I will have to rework tests and assembler but I will change the declaration to reflect the fact that we are doing V M rather than a M V
Ok have completed that bit, we now have consistent rotations and as a side effect Transform now wants the numbers where they should be in [x,y,z].W as per the standard math books.
class operator TGLZMatrix4f.*(constref A: TGLZMatrix4f; constref B: TGLZVector4f): TGLZVector4f;
Which to me suggests Matrix * Vector
What was originally coded was Vector Matrix and these two definitions are not the same. I coded tests for MV and changed pascal and assembler to suit.
I've noticed this at the begining of dev but not notice and take care that order of mul will be important
All is green now :)
Some change is necessary in native code. it have 2 or 3 place where the transform is compute MatVector instead of VectorMat.
In the same way, like you've noticed with SmoothStep we can also add operator Scalar, Vector in Addition to Vector, Scalar
Right after reviewing all these changes I have found an error in my logic in getting this working. I based MV versus VM based on the first encounter of where it mattered the non diagonal Transform. Getting this working originally got me stuck in the VM mode It Should actually be MV so hopefully by the time you read this I will have fixed it up correctly.
I am going to open a big can of worms here I suspect ;)
Keep in mind we are doing slower M.V again after last changes.
I have just done the timing test for M.V versus V.M on my win64 with an i5-2500 I get large performance boost for V.M ~16ms ->~10ms. For my older AMD FX(tm)-8120 Eight-Core (never that good on fpu performance) 19.5ms -> 17.8ms.
So it would seem there is an advantage in using Col-Major vectors for iteration over arrays of 4f. We have two choices
I hope the M-T and V.M works ok as it will be easier to implement and one transform before hitting a large array will be a lot cheaper that the transforms that are done in M.V.
Happy days again Transpose then Vec * Mat works fine. :+1: It would seem that this is also faster than quaternion transform.
All ok for me to. Speed factor for :
LookAt matrix done, efficiencies gained from using true math, removal of one transpose, have changed CrossProduct in 4f, this should return a vector not a point. Tests expanded to show this works SIMD
Check I have the intent of LookAt correct, comments in the functional test. Another pascal improvement commented in native inc, need to test these when/if we convert to asm. Already has SF of 2.3 by just using existing routines done in asm.
edit: updated for wrong assumption should behave as expected more code removed.
And another to check over the tests. I am going to start analysing the what the code is doing and try to find out why the last tthree tests in CreateParallelProjectionMatrix fails.
I'm just updload your latest tests. All work for me except AABB to ClipRect in Boundingbox
this is the result : AABB To ClipRect does not match : native : (X: 105,28728 ,Y: 72,80651 ,Z: 111,63911 ,W: 83,38283) <>(X: 549,40002 ,Y: -1583,59985 ,Z: 820,59991 ,W: -1312,39990)
and with matrix result :
AABB To ClipRect does not match : (X: 105,28728 ,Y: 72,80651 ,Z: 111,63911 ,W: 83,38283)<=>(X: 549,40002 ,Y: -1583,59985 ,Z: 820,59991 ,W: -1312,39990)
nMat : |(X: 1,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 1,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 8,51200 ,W: 1,00000)|
aMat : |(X: 1,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 1,00000 ,Z: 0,00000 ,W: 0,00000)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 8,51200)| |(X: 0,00000 ,Y: 0,00000 ,Z: 0,00000 ,W: 1,00000)|
the native code doesn't reflect TGLZMAtrix4f CreateParallelProjectionMatrix i think you've made correction in TGMAtrix4f. And this is the right CreateParallelProjectionMatrix function ?
Ok back from my little break, I see you have fixed the above issue.
I also see the Fustrum definintion but no code as yet, I have dummied a file with the defs for now so I can compile.
We are now at the stage in testing Matrices where we need to decide what pipelines we need to implement, looking at the next on my list the CreateOrtho and CreatePerspective these are currently biased toward GL. Do we need GDI style Creates and do you know what Vulcan is expecting in this area? (being Kronos developed I would expect this to follow GL but have not read any specs on Vulcan as yet)/
Hi,
I also see the Fustrum definintion but no code as yet
huch ! i don't take care something happens with my last commit
Sorry, i've made strange manipulations. So now it's ok
Do we need GDI style Creates and do you know what Vulcan is expecting in this area? (being Kronos developed I would expect this to follow GL but have not read any specs on Vulcan as yet)/
GDI, will use the same scheme as OpenGL., and we turned to a right hand sytem. So not a problem. Vulkan following the same scheme as OpenGL sor our function are good
Just in waiting i've finished the pipeline for shader and work very well in software mode. Now i must finished the 3D rendering pipeline, but I need to think again at certain points.
the first screenshoot is a displaced raymarching sphere with phong illumination and second an effect from http://glslsandbox.com/e#45003.0 ( you can't see but it's animated)
The pit of doom here is the sin/cos procedure and i cannot use fastmath for the raymarching. i must find a way to stay in range of -pi..pi or find and another way to compute fast SinCos with more accuracy
Take a look
I have some questions for you about matrices. I not find a good answer on web. I'd like to introduce Affine Matrix for transforming 2D points.
if i use TGLZMatrix4f as base represented by
[ a, b, c, d ] [ e, f, g, h ] [ i, j, k, l ] [ 0, 0, 0, 1 ]
i representing in my mind this structure TGLZMatrix2f = Class(TGLZMatrix3f) = class(TGLZMatrix4f) as affine like this
[ a, b, tx, 0 ] [ c, d, ty, 0 ] [ 0, 0, 1, 0 ] [ 0, 0, 0 0 ]
my question is the opreations like mul, transpose, invert.... in TGLZMatrix4f working as is if i set data like above or i must code new classes functions specially for matrix 3x3 and 2x2 ?
using a 4x4 matrix on a vector4f will work for mult det and transpose if
if tx,ty are 0 then det will be det of abef as long as k and [4,4] are 1.
Transpose will always work as it just reflects around the unit diagonal.
As to tx and ty I do not know but will have to test that they ok in the pos given or need to move to d and h ( I have a sneaking suspicion that they will need to be in d and h.
Translate works fine where you have Tx and Ty
a and d are scaleX and scaleY as usual.
As to invert will have to look into that I cannot say off the top of my head. Inverse works fine Will update this as I test.
so we have
. | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | ScaleX | RotA | TransX | Zero |
1 | RotA | ScaleY | TransY | Zero |
2 | Zero | Zero | One | Zero |
3 | Zero | Zero | Zero | One |
We will however need new methods to Create Rot Create Scale Create Transform as existing code either puts Transform in wrong place for this or stomps over where we want TransX and TransY to be.
Thanks Peter is more clear now for me.
Now the question on how do implementation :
We must simply add the creation functions such as 'create2DMatrix (Tx, Ty, Rx, Ry, Sx, Sx)' in TGLZMatrixHelper. Or it is better to duplicate the code and adapt the code asm according to the size of the matrix (3x3) as you did with TGLZHmgPlane? what do you think ?
Do another class rather than a helper as many optimisations can be done for 2D work
Still do not like the 2D naming as it indicates Array of double or Vector2 of double it is realy a 2f helper as we may also need a 2i helper as well for 2D work.
I will need a Vector2d going forward due to certain routines not having enough precision. FPUs can get away with singles as they do tend to extend bits somewhat and the reverse polish stack oriented usage of fpus does lead to less rounding errors than we get using SIMD instructions which are all done in isolation and compound errors more quickly than fpu stack.
Do another class rather than a helper as many optimisations can be done for 2D work
Ok i'begin in this way
Still do not like the 2D naming as it indicates Array of double or Vector2 of double it is realy a 2f helper as we may also need a 2i helper as well for 2D work.
No problem for me i'm open :) rename like you want. I follow you.
I will need a Vector2d going forward due to certain routines not having enough precision. FPUs can get away with singles as they do tend to extend bits somewhat and the reverse polish stack oriented usage of fpus does lead to less rounding errors than we get using SIMD instructions which are all done in isolation and compound errors more quickly than fpu stack.
If we need to work with FPU for Double.I must already have some asm code in a corner of my hd
We can use double in SIMD it has lots of double oriented instructions, at least for 2D vectors, they still fit into 128 xmm register.
Yes, it will be not very difficult to write now. in most case we just need change ps by pd
As you might have guessed from the title I have moved onto Matrix testing.
Testing the inverse function I noticed it is using a Adjunct / Det method of calculating the inverse. Maybe we need to find a row reduction method as well to test. In pascal it would be far more efficient, though in SSE not so sure as lots of branching in row reduction.