tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

Add ramp up/down for matmul #11308

Open TT-BrianLiu opened 1 month ago

TT-BrianLiu commented 1 month ago

Feature

For di/dt testing, we want to be able to configure matmul to ramp up/down at the beginning and end to smooth out current draw. This is an example of what it will look look:

For implementation, we will group matmul cores together and cores within a group sync with another (ie. only one core within the group can run compute). As we increase active cores, we will "shrink" the group size and vice versa. Ideally, we can configure two things:

Implementation

Branch: bliu/didt

For sharded 1D or 2D matmuls, there are two orthogonal features that you can try out by adding these settings in your environment variables:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}'

NOTE: You can probably run both features together, but I haven't tested that.

Test Results with PMON Capture

Test are run on sjc-snva-t3005 with chip 0, since this is the chip that consistently hangs in the 8-chip hang repro.

PMON visualization:

image

PMON fit:

image

Enable ramp tests

I modified the test to run only 5000 loops for captures instead of 10000. Test command is this (initial_cores is varied for experiments):

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": false, "enable_ramp": true, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_specific_chip_reproduce_matmul_2d_hang_t3000[logical_chip0]

Baseline (ie. don't set anything related to TT_MATMUL_RAMP_CONFIG):

image

Baseline with stagger (TT_ENABLE_MATMUL_STAGGER=1):

image

Initial cores 1:

image

Initial cores 8:

image

Initial cores 32:

image

Enable dummy loops tests

IMPORTANT EDIT: I think I may have messed up my implementation of these dummy loops. Each core acts independently of one another. So apart from some basic overhead of arithmetic on risc, the cores may be starting all at roughly the same time, with the "staggered" cores turning off near the end of ramp up. During ramp down, this should be fine as is since cores all start off doing work and slowly drop off. To properly implement the ramp up, we need to do some syncing between cores. We can do this by using simple mcasting:

Single chip tests

Same as before except I turn on enable_dummy_loops and turn off enable_ramp:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_specific_chip_reproduce_matmul_2d_hang_t3000[logical_chip0]

Baseline:

image

Initial cores 1:

image

Initial cores 8:

image

Multichip tests

Intent here is to see if voltage is different when running multi-chip workloads vs. single chip. PMON by default captures from chip 0, which happens to be the chip I am running my single chip tests on. If you are running single chip tests on another chip, you need to switch which device you are running PMON from.

Same environment setup as single chip tests, but switch the pytest option to the 8 chips one:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_reproduce_matmul_2d_hang[8chips-ff1-hang]

Baseline:

image

Initial cores 1:

image

Initial cores 8:

image
TT-BrianLiu commented 3 weeks ago

TLDR: