Xenko matrix memory layout discussion

xen2 commented 5 years ago

that is indeed a bit problematic. since directx as well es opengl and i guess bullet have exactly the same memory layout. i wonder what was leading to that decision and which action was taken. my feeling is that there was a couple of wrong assumptions of how matrix transposition and coordinate system handedness works.

Originally posted by @tebjan in https://github.com/xenko3d/xenko/pull/289#issuecomment-446223243

xen2 commented 5 years ago

@tebjan I think the idea at the time was to send data to HLSL as is (memcpy) and be able to keep HLSL order mul(W, V, P) (that's usually how we name them and that's the same order as in the C# code too) without having to slap row_major everywhere.

I also remember something like mul(point, World) is generating simpler code: dot3 line by line, while mul(World, point) is doing something slightly more complex where it has to take values from non contiguous memory.

However, I would be totally fine to revisit this and have your opinion. Would it be to switch back to M11/M12 and swap all shader matrix multiplications?

I personally don't like the M11/M21 memory layout either, it's opposite to what most engine do 😄

xen2 commented 5 years ago

FYI we had M11/M12 memory layout in first iteration of the engine the two reasons for the switch were:

(1) previously mentionned HLSL opcodes reasons: dot3 line by line now possible (or maybe it was the opposite? sorry it was in 2012!)

(2) Also I could find in our git history that contrary to what I thought, we had M11/M12 with mul(W, V, P) before the switch but we were transposing data when copying to cbuffer.

Note that due to the various rewrite and optimizations we are now doing, cbuffer is now treated as a memory block without distinction for individual variables when uploading so this kind of transpose in (2) is not possible anymore (and probably something we want to avoid anyway).

If we go back to M11/M12, I think now the only ways would be to either: (a) use row_major (which might not even exist in GLSL and others so probably a no go) (b) swap the matrix multiplications in shaders (I think the performance hit for vec3 transform is quite negligible, I doubt we are ALU bound)

So probably if we decide to go through, we would have to review a bunch of shaders and swap the multiplications.

tebjan commented 5 years ago

interesting write up, at first glance it sounds like a good decision. so mainly it's to have the matrix layout correct for the shaders... i have never seen this combination of memory layout and matrix/vector convention before. it remains to evaluate where this gets the unsuspecting user into trouble. i guess as soon as you get in contact with other libraries (as seen in #289).

before changing anything i would evaluate how other game engines solve that and what kind of conventions they follow? would be interesting to know.

tebjan commented 5 years ago

had a look into a vulkan example since it will probably be the dominant API in the future and they seem to write multiplications in openGL style, also they use the glm math lib: https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer#page_Vertex_shader

could the xksl -> v-spir compiler take care of something like this?

xen2 commented 5 years ago

OpenGL/Spir-V conversion is already taking care of that (and behavior can even be swapped using a flag): https://github.com/xenko3d/xenko/search?q=NoSwapForBinaryMatrixOperation&unscoped_q=NoSwapForBinaryMatrixOperation

However, this is irrelevent: If we were to swap memory layout and mul order in xksl, we wouldn't need to touch that flag anyway as OpenGL code would end up being swapped too (unless we start to use row_major on D3D and nothing on OpenGL).

I think we currently end up with proj * view * world * pos in OpenGL already (might be good to make sure of that).

xen2 commented 5 years ago

For reference:

Shader code with M11/M12:

   0: mul r0.xyzw, v0.yyyy, World[1].xyzw
   1: mad r0.xyzw, World[0].xyzw, v0.xxxx, r0.xyzw
   2: mad r0.xyzw, World[2].xyzw, v0.zzzz, r0.xyzw
   3: mad r0.xyzw, World[3].xyzw, v0.wwww, r0.xyzw

Shader code with M11/M21:

   0: dp4 r0.x, v0.xyzw, World[0].xyzw
   1: dp4 r0.y, v0.xyzw, World[1].xyzw
   2: dp4 r0.z, v0.xyzw, World[2].xyzw
   3: dp4 r0.w, v0.xyzw, World[3].xyzw

So no worries there, it's not bigger and the write differences (component by component vs full vector update) should be negligeable compared to all other stuff we do in the shader. Let's ignore those details for the memory layout choice.

Engines:

Unity: memory layout M11/M12
UE4: memory layout M11/M12
Xenko: memory layout M11/M21

So far, the biggest downside to switch back to M11/M12 I can think of is inconsistencies between shader and C# code:

C# (user): pos * world * view * proj
XKSL & HLSL (user): proj * view * world * pos
GLSL (auto-generated): pos * world * view * proj

xen2 commented 5 years ago

One last detail that comes to mind: simpler matrices (i.e. bones) can often be encoded in float4x3. Currently it takes 12 bytes, switching to M11/M12 would make them take 16 bytes. Even though it was planned for mobile device with small cbuffer, we didn't do it due to limitations in our ES2 layer, so we can ignore this issue. The impact is also negligible with recent GPU.

xen2 commented 5 years ago

So all things considered, once we decide what to do with the C#/HLSL differences (keep it like this? or swap C#?), I would be fine to go back to M11/M12 memory layout order.

tebjan commented 5 years ago

excellent write up, and now that i understand the decisions you made better, i am not sure about my vote. personally i also like the pos * world * view * proj order in shader + c# very much. it is also what i am used to from writing shaders.

but if the majority of engines and examples from vulkan and nvidia use the openGL style, i would rather go this way and re-learn it. just be be able to easily copy GPU gems or other examples into XKSL and have minimal work to adapt it.

another wild idea could be, since you have a flag and the code already, that you can define the mul order in the shader file somehow, via a flag/#define or something like that...

xen2 commented 5 years ago

Took a quick look, it's true most D3D samples tend to have the pos * world * view * proj order. I took a closer look and it's because they either have:

row_major: https://github.com/Microsoft/DirectX-Graphics-Samples/blob/07008938a0dc5a187a23abcb55b61f8c2809c874/Samples/Desktop/D3D12nBodyGravity/src/ParticleDraw.hlsl#L47
they transpose manually, ex: https://github.com/Microsoft/DirectX-Graphics-Samples/blob/07008938a0dc5a187a23abcb55b61f8c2809c874/Samples/UWP/D3D12Bundles/src/FrameResource.cpp#L140
they use ID3DX11EffectMatrixVariable which do the transpose under the hood: https://github.com/Microsoft/FX11/blob/db9527b6ea49553596c702fcb39c34472d4750de/EffectVariable.inl#L2674 (you can notice Transpose (false) is checked against `pType->NumericType.IsColumnMajor‘　which is true

stride3d / stride

Xenko matrix memory layout discussion #291