Feature

For di/dt testing, we want to be able to configure matmul to ramp up/down at the beginning and end to smooth out current draw. This is an example of what it will look look:

loop 1: 8 cores active
loop 2: 16 cores active
loop 3: 32 cores active
loop 4 to N-3: all cores active
loop N-2: 32 cores active
loop N-1: 16 cores active
loop N: 8 cores active

For implementation, we will group matmul cores together and cores within a group sync with another (ie. only one core within the group can run compute). As we increase active cores, we will "shrink" the group size and vice versa. Ideally, we can configure two things:

Maximum number of active cores at the beginning and end of the matmul
How fast we ramp up/down (ie. by factor of 2, 4, etc...)

Implementation

Branch: bliu/didt

For sharded 1D or 2D matmuls, there are two orthogonal features that you can try out by adding these settings in your environment variables:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}'

enable_dummy_loops: If true, insert additional loops of dummy matmul_block compute at the beginning and end of actual matmul loops to gradually turn on/off cores. In the middle, run actual matmul at full capacity.
enable_ramp: If true, puts the actual matmul loops into a start, middle, and end phase where cores are grouped and are gradually turned on/off within the group.
initial_cores: How many cores to start/end with. This is used to determine the group size. ie. If 64 cores total, initial cores of 8 means group size of 8; if 56 cores total, initial cores of 8 means group size of 7.
ramp_multiple: How fast we turn on/off cores. If initial cores of 8 and ramp multiple of 2, you the second loop 16 cores will be running, etc... This applies to both enable_dummy_loops and enable_ramp feature, although they are implemented slightly differently.

NOTE: You can probably run both features together, but I haven't tested that.

Test Results with PMON Capture

Test are run on sjc-snva-t3005 with chip 0, since this is the chip that consistently hangs in the 8-chip hang repro.

PMON visualization:

PMON fit:

Enable ramp tests

I modified the test to run only 5000 loops for captures instead of 10000. Test command is this (initial_cores is varied for experiments):

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": false, "enable_ramp": true, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_specific_chip_reproduce_matmul_2d_hang_t3000[logical_chip0]

Baseline (ie. don't set anything related to TT_MATMUL_RAMP_CONFIG):

Baseline with stagger (TT_ENABLE_MATMUL_STAGGER=1):

Initial cores 1:

Initial cores 8:

Initial cores 32:

Enable dummy loops tests

IMPORTANT EDIT: I think I may have messed up my implementation of these dummy loops. Each core acts independently of one another. So apart from some basic overhead of arithmetic on risc, the cores may be starting all at roughly the same time, with the "staggered" cores turning off near the end of ramp up. During ramp down, this should be fine as is since cores all start off doing work and slowly drop off. To properly implement the ramp up, we need to do some syncing between cores. We can do this by using simple mcasting:

Add the same dummy loops to reader kernel of all cores
Add syncs between reader and compute
Control the loops by using core 0 (since it should always be running every dummy loop). Core 0 is sender, all other cores are receivers. Every time core 0 is about to start compute, mcast VALID to all other cores to let them know it is the next loop.

Single chip tests

Same as before except I turn on enable_dummy_loops and turn off enable_ramp:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_specific_chip_reproduce_matmul_2d_hang_t3000[logical_chip0]

Baseline:

Initial cores 1:

Initial cores 8:

Multichip tests

Intent here is to see if voltage is different when running multi-chip workloads vs. single chip. PMON by default captures from chip 0, which happens to be the chip I am running my single chip tests on. If you are running single chip tests on another chip, you need to switch which device you are running PMON from.

Same environment setup as single chip tests, but switch the pytest option to the 8 chips one:

TT_MATMUL_RAMP_CONFIG='{"enable_dummy_loops": true, "enable_ramp": false, "initial_cores": 8, "ramp_multiple": 2}' 
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/didt/test_sharded_ff1.py::test_reproduce_matmul_2d_hang[8chips-ff1-hang]

Baseline:

Initial cores 1:

Initial cores 8:

tenstorrent / tt-metal

Add ramp up/down for matmul #11308

Feature

Implementation

Test Results with PMON Capture

Enable ramp tests

Enable dummy loops tests

Single chip tests

Multichip tests