Open rtawfik01 opened 2 months ago
For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows?
and the we pack out just the 1 row we care about
For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows?
and the we pack out just the 1 row we care about
Yes the unpacker/packer would only operate on the number of rows required, but the FPU will have to do 8*16.
@mywoodstock @yugaoTT Please let me know if you guys can make a test for matmul with smaller tile shapes to enable the LLK work, or I can also assign or ping someone else. The test should be single core, input tensors should be able to support variable tile shapes as shown above, and able to set variable number of tiles as well for debug.
@rtawfik01 I can make the test, can you show me how to enable small tiny tiles? Is it going to be extra argument passed down to matmul?
For 1x32 , unpacker could load just 1 row , and matmul can operate on garbage for the other 7 rows? and the we pack out just the 1 row we care about
Yes the unpacker/packer would only operate on the number of rows required, but the FPU will have to do 8*16.
Yes, that's good!
@rtawfik01 I can make the test, can you show me how to enable small tiny tiles? Is it going to be extra argument passed down to matmul?
In this file: tt_metal/third_party/tt_llk_wormhole_b0/llk_lib/llk_unpack_AB_matmul.h
, you'll find these flags for a few different function calls:
unpA_face_r_dim
unpB_face_r_dim
unpA_num_faces
unpB_num_faces
unpA_partial_face
unpB_partial_face
There are 2 ways to populate the above flags, either we expose them in the api here: tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_unpack_AB_matmul_api.h
, or we generate an array in the built folder, similar to the chlkc_*pack_data_format arrays, that have the num_faces, face_r_dim, partial_tile information per CB id (that is the way it was done in Buda). That would be done in this file: tt_metal/jit_build/genfiles.cpp
.
Similar to the unpacker, those same flags are in : tt_metal/third_party/tt_llk_wormhole_b0/llk_lib/llk_math_matmul.h
in0_tile_r_dim,
in0_tile_c_dim
in1_tile_r_dim,
in1_tile_c_dim
partial_face
and again would either need to be exposed or set from the arrays that have information about num_faces & face_r_dim per CB id.
If we choose the method of generating arrays in some header define files with information about num_faces, face_r_dim, etc, then we will also need to discuss what kind of data structure jit_build would need to receive that information.
@yugaoTT we can sync to discuss what the best methodology is here from the op team point of view.
@davorchap @yugaoTT do you know the timeline by which this feature needs to be completed ?
Support for shapes 16x32, 32x16, 16x16, were added in the above PR. Other shapes such as:
8x32 4x32 2x32 1x32
Needs to be worked on.
We'll need tiny tiles support (at least for shapes 8x32, and 1x32) for the TG llama sprint. Target date is end of year, but we'll need to test and also implement higher level changes in the ops using it. So it would be good to have this supported by end of November or early December.
I don't know about priority relative to other work going on, e.g. blackhole work as @amahmudTT mentioned. Maybe @davorchap can help there.
Matmul with smaller shapes for input 0 (Src B) are needed to enable higher performance for lower batch LLMs.
There are 2 main tasks:
Current shapes supported for Matmuls in the LLKs are:
They need flags to be exposed the
llk_math_matmul_api.h
layer, andcompute_kernel_api/matmul.h
to enable them. This issue is also documented here: https://github.com/tenstorrent/tt-metal/issues/8122Other shapes such as:
Need to be investigated, and check if support can be added for them as well. The FPU function for matrix multiplication
MVMUL
performs aDest[8, 16] = SrcB [8x16] * A[16, 16]
, so FPU lowest granularity for input 0 is 8 rows. For enabling smaller shapes, SrcB will have to be ZEROSRC, then unpack smaller shapes (1,2,4 x32). Still need to investigate if FPU will be able to use the smaller shapes.@davorchap For this task, the LLK team will need to be enabled with a testing environment that is single core, with input tensors that can be modified, and can support tensors with tile dimensions less than 32x32. There will be runtime support needed to allow Circular buffers to have tiles with smaller shapes.
@ttmtrajkovic fyi, we will need more investigation for the second task, 8x32 might be easier to support but need the testing environment first.