tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
484 stars 79 forks source link

Matmul Tiny Tiles #12403

Open rtawfik01 opened 2 months ago

rtawfik01 commented 2 months ago

Matmul with smaller shapes for input 0 (Src B) are needed to enable higher performance for lower batch LLMs.

There are 2 main tasks:

Current shapes supported for Matmuls in the LLKs are:

  1. 16x32 (Input0) & 32x32 (Input 1)
  2. 16x32 (Input0) & 32x16 (Input 1)
  3. 32x32 (Input0) & 32x16 (Input 1)

They need flags to be exposed the llk_math_matmul_api.h layer, andcompute_kernel_api/matmul.h to enable them. This issue is also documented here: https://github.com/tenstorrent/tt-metal/issues/8122

Other shapes such as:

Need to be investigated, and check if support can be added for them as well. The FPU function for matrix multiplication MVMUL performs a Dest[8, 16] = SrcB [8x16] * A[16, 16] , so FPU lowest granularity for input 0 is 8 rows. For enabling smaller shapes, SrcB will have to be ZEROSRC, then unpack smaller shapes (1,2,4 x32). Still need to investigate if FPU will be able to use the smaller shapes.

@davorchap For this task, the LLK team will need to be enabled with a testing environment that is single core, with input tensors that can be modified, and can support tensors with tile dimensions less than 32x32. There will be runtime support needed to allow Circular buffers to have tiles with smaller shapes.

@ttmtrajkovic fyi, we will need more investigation for the second task, 8x32 might be easier to support but need the testing environment first.

davorchap commented 2 months ago

For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows?

and the we pack out just the 1 row we care about

rtawfik01 commented 2 months ago

For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows?

and the we pack out just the 1 row we care about

Yes the unpacker/packer would only operate on the number of rows required, but the FPU will have to do 8*16.

rtawfik01 commented 2 months ago

@mywoodstock @yugaoTT Please let me know if you guys can make a test for matmul with smaller tile shapes to enable the LLK work, or I can also assign or ping someone else. The test should be single core, input tensors should be able to support variable tile shapes as shown above, and able to set variable number of tiles as well for debug.

yugaoTT commented 2 months ago

@rtawfik01 I can make the test, can you show me how to enable small tiny tiles? Is it going to be extra argument passed down to matmul?

davorchap commented 2 months ago

For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows? and the we pack out just the 1 row we care about

Yes the unpacker/packer would only operate on the number of rows required, but the FPU will have to do 8*16.

Yes, that's good!

rtawfik01 commented 2 months ago

@rtawfik01 I can make the test, can you show me how to enable small tiny tiles? Is it going to be extra argument passed down to matmul?

In this file: tt_metal/third_party/tt_llk_wormhole_b0/llk_lib/llk_unpack_AB_matmul.h, you'll find these flags for a few different function calls:

unpA_face_r_dim 
unpB_face_r_dim 
unpA_num_faces
unpB_num_faces
unpA_partial_face
unpB_partial_face

There are 2 ways to populate the above flags, either we expose them in the api here: tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_unpack_AB_matmul_api.h, or we generate an array in the built folder, similar to the chlkc_*pack_data_format arrays, that have the num_faces, face_r_dim, partial_tile information per CB id (that is the way it was done in Buda). That would be done in this file: tt_metal/jit_build/genfiles.cpp.

Similar to the unpacker, those same flags are in : tt_metal/third_party/tt_llk_wormhole_b0/llk_lib/llk_math_matmul.h

in0_tile_r_dim,
in0_tile_c_dim
in1_tile_r_dim, 
in1_tile_c_dim  
partial_face 

and again would either need to be exposed or set from the arrays that have information about num_faces & face_r_dim per CB id.

If we choose the method of generating arrays in some header define files with information about num_faces, face_r_dim, etc, then we will also need to discuss what kind of data structure jit_build would need to receive that information.

@yugaoTT we can sync to discuss what the best methodology is here from the op team point of view.

amahmudTT commented 2 months ago

@davorchap @yugaoTT do you know the timeline by which this feature needs to be completed ?

amahmudTT commented 1 month ago

Support for shapes 16x32, 32x16, 16x16, were added in the above PR. Other shapes such as:

8x32 4x32 2x32 1x32

Needs to be worked on.

johanna-rock-tt commented 2 weeks ago

We'll need tiny tiles support (at least for shapes 8x32, and 1x32) for the TG llama sprint. Target date is end of year, but we'll need to test and also implement higher level changes in the ops using it. So it would be good to have this supported by end of November or early December.

johanna-rock-tt commented 2 weeks ago

I don't know about priority relative to other work going on, e.g. blackhole work as @amahmudTT mentioned. Maybe @davorchap can help there.