tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 48 forks source link

Pack out narrow sticks contiguously using pack_untilize #12076

Closed yugaoTT closed 5 days ago

yugaoTT commented 2 weeks ago

Currently pack_untilize will pack out non-tile size sticks with gaps in-between, for example, when packing out 16B sticks (8 datums bfp16 format), there will be 64B-16B = 48B gaps between those sticks. We need a new pack_untilize to pack out sticks contiguously so that there's no gaps in-between.

rtawfik01 commented 2 weeks ago

To summarize what was discussed in the meeting, if the mop is changed to this:

            ckernel::ckernel_template tmp(MOP_OUTER_LOOP, MOP_INNER_LOOP,TT_OP_PACR(ADDR_MOD_1, ZERO_OUTPUT_FLAG, PACK_SEL(PACKCNT), 0, MEGAROW, 0, 0));
            tmp.set_end_op(TT_OP_INCADCZW(p_setadc::PAC, 0, 0, 1, 0)); // w cnt points to the next tile
            tmp.program(instrn_buffer);

And the x_dim is set to the correct value in the init:

      uint pack_x_dim = 8;
      TT_SETADCXX(p_setadc::PAC, pack_x_dim-1, 0x0);

@yugaoTT please let me know

yugaoTT commented 2 weeks ago

seems to be working now with single tile, one more change: needs to update L1 write offset, otherwise there will be gaps between the 4 packer writes

yugaoTT commented 2 weeks ago

next thing to try: merge the outer loops over num_rows (8) into MOP inner loop, so we don't have any loops outside the MOP

rtawfik01 commented 1 week ago

The submodule changes are pushed here: https://github.com/tenstorrent/tt-llk-wh-b0/pull/35 https://github.com/tenstorrent/tt-llk-gs/pull/20

Please close this issue once the metal PR is also pushed