tenstorrent / tt-llk-bh

Tenstorrent low-level tensix kernels for Blackhole
Apache License 2.0
3 stars 0 forks source link

Pack Untilize Kernel Perf Optimization #20

Open rtawfik01 opened 1 month ago

rtawfik01 commented 1 month ago

Pack untilize kernel needed to be re-written for Blackhole, due to Blackhole packer limitations:

  1. Only one Dest offset register for the packer instance (but we can use strided mode of 16 rows apart)
  2. PACR instructions can only write L1 contiguous output.
  3. These L1 Offset registers, are now set per PACR Context, not per rows output from PACR instruction (i.e PACK_INTF_SEL)
    THCON_SEC0_REG1_L1_Dest_addr_ADDR32 -> Context 0
    THCON_SEC0_REG8_L1_Dest_addr_ADDR32 -> Context 1
    THCON_SEC1_REG1_L1_Dest_addr_ADDR32 -> Context 2
    THCON_SEC1_REG8_L1_Dest_addr_ADDR32 -> Context 3

Here is the current algorithm: image

For wormhole b0, 8x16 rows can be output per tile (before incrementing counters), while for blackhole with the above implementation, only 2x16 can. Since pack_untilize feature needs to be fused with other operations, we cannot change the unpacker to use a different unpacking scheme (Fastest untilize scheme would be T0F0, T0F1, T1F0, T1F1, T0F2, T0F3, T1F2, T1,F3)). However, we should use that implementation if the use case is doing a pack_untilize for a block of tiles (i.e not fused).

One major optimization, is to enable the use of contexts.

THCON_SEC0_REG1_L1_Dest_addr_ADDR32 = CNTX0 = tile_offset (Configured for top faces) THCON_SEC0_REG8_L1_Dest_addr_ADDR32 = CNTX2 = tile_offset + block_ct_dim TILE_C_DIM datum_size ((Configured for bottom faces)

Then algorithm can do 4x16 rows, will change to something like this:

CH0 x_stride = number of bytes per datum (2 bytes for Float16b default)
CH0 y_stride = FACE_C_DIM*x_stride
CH0 z_stride = FACE_R_DIM*y_stride (1x16x16 datums default, already done in pack_hw_config)
CH0 w_stride = 4*z_stride (32x32 by default)
PACK_INTF_SELECT_0 = 0b0011 (read 2 rows from dest)
PACK_INTF_SELECT_1 = 0b1100 (read 2 rows from dest)
Dest Mode = DST_STRIDED_MODE (Each row read from dest is 16 apart)

    for face_r_dim (16 by default):
        for block_ct_dim (2 in this example) {
            PACK(CNTX0, PACK_INTF_SELECT_0); (Reads 2 rows, writes them contigous in L1, dest rows = 0,16, 1, 17, 2, 18 .....)
            PACK(CNTX1, PACK_INTF_SELECT_1); (Reads 2 rows, writes them contigous in L1, dest rows = 32,48, 33, 49,  .....)
            W_CNT += 1 (Jumps to next tile)
        }
        Y_CNT+=1
    }

The problem here is if contexts are used, initial hardware configuration for data formats, tile sizes, etc, all need to also be programmed.

@ttmtrajkovic @rdjogoTT fyi

rtawfik01 commented 4 weeks ago

Measured perf for pack untilize block test with following args: block C dim: 4 tiles block R dim: 1 tile Num cores: 1

To reproduce:

git checkout rtawfik/pack_untilize_perf
ENABLE_TRACY=1 scripts/build_scripts/build_with_profiler_opt.sh
ninja tests -C build
ENABLE_TRACY=1 TT_METAL_DEVICE_PROFILER=1 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/unit_tests --gtest_filter=“*ComputePackUntilize*”

And the Tracy GUI results can be seen here: image Cycles can be calculated using the machines AICLK and The GPU execution time

The branch has the blocking calls for circular buffers and math-pack semaphores commented out. The results for the number of cycles the PACK takes is:

WHB0: ~145 cycles (average 10 runs) BH: ~155 cycles (average 10 runs)

The results consistently show BH and WHB0 have around the same cycles count difference for block of tiles less than 6. For c_dim > 6, unfortunately Wormhole B0 has a smaller depth instruction buffer (16 insns), in comparison to Blackhole (32 insns), and that affects the results since at around 6 tiles is when math issues 16 insns and needs to stall for wormhole b0. Better results can be obtained from waveforms.

@ttmtrajkovic fyi