column_major and row_major matrix annotations, WRT pre/post matrix multiplication

JasonHuang3D commented 2 years ago

Current states by using latest dawn and latest tint on DX12:

Pre-multiplication:
```
var transformedPos:vec4<f32> = vec4<f32>(input.position, 1.0) * uParams.matrix;
```
And after tint compiling it to DXBC (using PIX for DX12 as profiler for dawn):

Post-multiplication:

var transformedPos:vec4<f32> = uParams.matrix * vec4<f32>(input.position, 1.0);

and DXBC from tint:

As we can see, the second one with post-multiplication used 3 multiply and 3 add instructions where the first one used 4 x dp4.

Proposed optimization for tint:

Using "mad" instruction (multiply and add) of Shader Module 5 assembly to optimize the post-multiplication from second one, the best we can get is:
```
mul r0.xyzw, v0.yyyy, CB0[0][1].xyzw
mad r0.xyzw, v0.xxxx, CB0[0][0].xyzw, r0.xyzw
mad r0.xyzw, v0.zzzz, CB0[0][2].xyzw, r0.xyzw
add r0.xyzw, r0.xyzw, CB0[0][3].xyzw  
```
Instead of using 6 instructions, now we are using 4.

Proposed wgsl feature:

Adding a "variable-type modifier" to wgsl which is similar to HLSL. e.g.

   struct GlobalParams                                                                                                                                  
    {
       row_major viewMatrix : mat4x4<f32>;
       column_major projMatrix : mat4x4<f32>;                                                                                                                                 
    };

So now multiplication for matrix becomes:

Pre-multiplications:

// Pre- multiplication with raw-major matrix
var transformedPos:vec4<f32> = vec4<f32>(input.position, 1.0) * uParams.viewMatrix; 
// DXBC:
// 1 mul + 2 mad + 1 add, as described above.

// Pre- multiplication with  column-major matrix
var transformedPos:vec4<f32> = vec4<f32>(input.position, 1.0) * uParams.projMatrix ; 
// DXBC:
// 4 x dp4, as described above.

Post-multiplications:

// Post- multiplication with raw-major matrix
var transformedPos:vec4<f32> = uParams.viewMatrix * vec4<f32>(input.position, 1.0); 
// DXBC:
// 4 x dp4, as described above.

// Post- multiplication with  column-major matrix
var transformedPos:vec4<f32> = uParams.projMatrix * vec4<f32>(input.position, 1.0) ; 
// DXBC:
// 1 mul + 2 mad + 1 add, as described above.

Benifiets:
- It is not changing the constant buffer memory layout order, but the generated shader assembly instructions. It only gives compiler a hint to generate desired instructions according to the variable-type modifier.
- Gives a more explicit way for the user to control their matrix binding strategy without depending on changing the constant buffer layout on CPU or sometimes need a transpose for different major order matrix on CPU. e.g. a row major based matrix lib on CPU side can now use it in shader with pre-multiplication and without transpose before uploaded to GPU buffer.
```
// viewMatrix as decalred as row_major
//  Now passing raw-major matrix from cpu without transposing would work the same as post-multiplication(matrix * vector )
var transformedPos:vec4<f32> = vec4<f32>(input.position, 1.0) * uParams.viewMatrix; 
```
- Let user to determine where should be used with either "4xdp4" or "1 mul + 2 mad + 1 add".

Refactor current documentations of matrix multiplication, as it always suggesting to use post multiplication, which is not always true, as when running on DX12, current implementation has a bit performance overhead as described above.

X-OldGentleMan commented 2 years ago

good job！ The pos * Matrix compiler does have a much smaller instruction set, and it does improve a lot, in the case of a very large number of vertices

Kangz commented 2 years ago

Just to make sure I understand, can you confirm that this issue contains three things:

a feature request for some [[column_major]] / [[row_major]] attributes to help with bindings and performance of matrix multiplication?
an issue report for the codegen of Tint that produces a bit too much code and could instead use MAD instructions directly?
a request to have vector * matrix appear somewhere in the WGSL spec to make it clear that it is an allowed operation?

The two requests related to the standard can be here, but the Tint codegen issue should be on crbug.com/tint because this issue tracker is not for implementations. I'll do it be reusing the Dawn issue you filed.

ben-clayton commented 2 years ago

Tint does not generate DXBC instructions directly. It generates HLSL which is handed off to FXC to be compiled. Due to a number of bugs in FXC (which is no longer maintained), we have resorted to disabling FXC's optimizations. We are aware that this is not a good situation to be in, and we are currently evaluating our options - these include generating DXIL (SM 6+) instead and/or using an alternative way to generate the DXBC.

ben-clayton commented 2 years ago

Linking to the tint bug to close the loop.

JasonHuang3D commented 2 years ago

a feature request for some [[column_major]] / [[row_major]] attributes to help with bindings and performance of matrix multiplication?

Yes, the benefits of doing this I have been mentioned above. To be more clear, it only gives a hint to the shader compiler to use a proper matrix math instead of changing the GPU buffer read or layouts.

a request to have vector * matrix appear somewhere in the WGSL spec to make it clear that it is an allowed operation?

Yes, and as current implementation of tint, it is suggested to use vector matrix rather than matrix vector. At least when targeting DX12.

an issue report for the codegen of Tint that produces a bit too much code and could instead use MAD instructions directly?

@Kangz Sorry to post the implementation issue here, but since I believe it is a good ideal to put them together and make the issue context to be more clear.

JasonHuang3D commented 2 years ago

Tint does not generate DXBC instructions directly. It generates HLSL which is handed off to FXC to be compiled. Due to a number of bugs in FXC (which is no longer maintained), we have resorted to disabling FXC's optimizations. We are aware that this is not a good situation to be in, and we are currently evaluating our options - these include generating DXIL (SM 6+) instead and/or using an alternative way to generate the DXBC.

@ben-clayton Thanks for clarifying things much better, I agree that abandon old FXC is a good ideal. Are you suggesting we use DXC instead? do you need any help for the new implementations to speed up the progress?

kvark commented 2 years ago

I find these two points contradicting each other:

It is not changing the constant buffer memory layout order, but the generated shader assembly instructions. It only gives compiler a hint to generate desired instructions according to the variable-type modifier. Gives a more explicit way for the user to control their matrix binding strategy without depending on changing the constant buffer layout on CPU or sometimes need a transpose for different major order matrix on CPU. e.g. a row major based matrix lib on CPU side can now use it in shader with pre-multiplication and without transpose before uploaded to GPU buffer.

I understand the layout precisely as the rule to bind a particular name/value of the shader to memory. If your suggested attribute changes the "binding strategy" then it's changing the layout.

Exposing row/column majority in WGSL would certainly make some of the things simpler, like translation from GLSL and SPIR-V to WGSL. I don't know if this is needed so much that we'd want it for V1. Probably not? Easy to add at any point later.

magcius commented 2 years ago

There's some challenges in how we want to handle this. There's a few different concepts here: packing order of matrices (viewing all 16 elements of a float4x4 matrix as they exist in uniform buffer memory, is arr[1] part of the first column or the first row?), access matrix order (in the shading language itself, is m[0] a row or a column vector?), and matrix-vector multiplication order (does m * v treat v as a row or a column vector?). HLSL and GLSL differ in these semantics in subtle and unobvious ways.

From what I can tell, WGSL is currently defined to match GLSL semantics:

arr[1] belongs to the first column.
m[0] is a column vector
m * v treats v as a row vector (e.g. out[i] = dot(m[i], v))

This proposal is to add a row_major flag similar to what HLSL and GLSL have. These flags only change the packing semantics: arr[1] would belong to the first row, but m[0] would still be a column vector, and m * v would still treat v as a row vector.

Note that this is not necessary for any optimizations; as you discovered, switching multiplication order changes the interpretation of the vector (v * m treats v as a column vector, which can use mad instead of dot). One can also call transpose manually: transpose(m) * v. So this shouldn't be necessary for any performance improvements.

Personally, the confusion and subtle inconsistencies around matrices in shader languages is why I prefer not to use any of the native matrix types myself; it's easy enough to do array<vec4<f32>, 3> yourself, and do your own dot products.

kdashg commented 2 years ago

m * v treats v as a row vector (e.g. out[i] = dot(m[i], v))

I think you mean col vector here. Col-vecs have one column, row-vecs have one row, so Mrow_vec isn't a valid matrix multiplication possibility. (`ArowsXBcols BrowsXCcols -> ArowsXCcols`, B must match)

You should definitely use the real matrix types, though I agree that they can be confusing which of matrixAxB is rows vs cols.

kdashg commented 2 years ago

Basically, the math yields 4-dots behavior for vec4*mat4x4ColMajor and mat4x4RowMajor*vec4 both.

magcius commented 2 years ago

After reviewing my notes, yes, I did mean column vector. The rest of the comment still stands, I believe.

ben-clayton commented 2 years ago

Tint already swaps the LHS and RHS arguments for matrix multiplies when generating HLSL (I assume Naga does the same), so that the matrix memory layout matches the expectations of WGSL.

I feel there's a lot of different things being discussed here, so to reiterate @Kangz's earlier comment: (a) Chrome is producing inefficient DXBC for matrix multiplies. Discussed further here. I don't think there's much more to discuss here. (b) Maybe we want to support column-major and row-major matrices, which I imagine would require flipping the LHS and RHS of matrix multiplies, and adjusting matrix indexing ops to avoid ~costly~ transposes. (c) We might want to improve the spec to make it clearer we support vector matrix and matrix vector.

If people feel strongly about (b) and (c), I suggest they are filed as separate issues and we close this issue.

edit: Updated to remove the suggestion that a transpose would likely be more costly than flipping matrix multiply order. Most GPUs are now scalar architecture, so they'd likely evaluate to the same thing.

JasonHuang3D commented 2 years ago

This proposal is to add a row_major flag similar to what HLSL and GLSL have. These flags only change the packing semantics: > arr[1] would belong to the first row, but m[0] would still be a column vector, and m * v would still treat v as a row vector.

@magcius I think Tint is generating the constant buffer struct as multiple of uint4 array for now. So whatever you put inside of a uniform struct type, will be translating to a unit4 array. e.g.

[[block]]
struct GlobalParams                                                                                                                                   
{
    viewMatrix : mat4x4<f32>;
    projMatrix : mat4x4<f32>;
    rotationMatrix : mat4x4<f32>;                                                                                                                                         
};
[[group(0), binding(0)]] var<uniform> uGlobalParams: GlobalParams;

After tint translating this into HLSL will be:

cbuffer cbuffer_tint_symbol_12
{
   uint4 tint_symbol_12[12];          // Offset:    0 Size:   192
}

My proposal is not actually changing the packing semantics, it ONLY gives a hint to shader compiler to generate different operation sets where you use those matrices with multiplication specifically.

Basically, the math yields 4-dots behavior for vec4mat4x4ColMajor and mat4x4RowMajorvec4 both.

@jdashg Exactly. it only changes the operations set generated from shader compiler.

@kvark Sorry for any confusion:

I did not mean to change any buffer layout or resource bindings, those 2 row/column type modifier only changes the final instructions set with matrix math. Please kindly refer to what @jdashg posted above.
It dose change the "binding strategy" from user perspective, as it gives more options for users for how they send their matrix values from CPU. Consider we are using a row-major matrix lib on CPU side, The followings scenarios provide the same result without explicitly change any gpu buffer layout.
- Case 1:
  - Transpose the matrix on CPU side.
  - In shader, declare matrix as column-major type modifier.
  - In shader, use *vector matrix**.
- Case 2:
  - Transpose the matrix on CPU side.
  - In shader, declare matrix as row-major type modifier.
  - In shader, use *matrix vector**.
- Case 3:
  - Do not Transpose the matrix on CPU side.
  - In shader, declare matrix as row-major type modifier.
  - In shader, use vector * matrix.
- Case 4:
  - Do not Transpose the matrix on CPU side.
  - In shader, declare matrix as column-major type modifier.
  - In shader, use matrix* vector.

greggman commented 2 years ago

Let me first say I probably don't know what I'm talking about but......

Looking at all the hoops ANGLE has to jump through to support GLSL column_major on MSL (and SPIR-V?) is insane! Entire shaders have to be re-written an expanded, matrices copied, matrix arrays copied, etc..... You don't want to pass this complexity on to WGSL/WebGPU IMO. The supposed gain you'll get on DX12 will be a huge loss on Metal (and SPIR-V?)

If this is somehow trivial to support across platforms then I have no objections but my impression, looking at ANGLE, is that this is not trivial to support across platforms. Maybe ANGLE just chose the wrong solution? But in any case, please check this is actually trivial to support before proceeding.

magcius commented 2 years ago

It's not a huge loss on any platform. It's basically a wash. The DXBC bytecode comparison is misleading.

kdashg commented 2 years ago

WGSL meeting minutes 2022-05-10

* DN: BC interacted and impression was person hadn’t understood what was done. Haven’t read myself. * KG: Proposal to add row/column major. Case where not really sure it will matter. The abstract operations do count differently but in silicon may not be faster either way. Feel like we want to say doesn’t matter and to come back if find out it does matter. * MM: Sounds like a post-v1 feature. * KG: Can through in that box. * MM: **Retitle also to actual request instead of `issues`**

gpuweb / gpuweb