Open TT-BrianLiu opened 1 month ago
TLDR:
enable_ramp
: This features makes cores alternate compute for initial and tail-end matmul loops, which may not be ideal for power usage. So I started with a simpler approach of adding dummy loops at the beginning and end.enable_dummy_loops
: Feature was not implemented correctly. We need to sync the dummy loops across cores with mcasting:
Feature
For di/dt testing, we want to be able to configure matmul to ramp up/down at the beginning and end to smooth out current draw. This is an example of what it will look look:
For implementation, we will group matmul cores together and cores within a group sync with another (ie. only one core within the group can run compute). As we increase active cores, we will "shrink" the group size and vice versa. Ideally, we can configure two things:
Implementation
Branch:
bliu/didt
For sharded 1D or 2D matmuls, there are two orthogonal features that you can try out by adding these settings in your environment variables:
enable_dummy_loops
: If true, insert additional loops of dummymatmul_block
compute at the beginning and end of actual matmul loops to gradually turn on/off cores. In the middle, run actual matmul at full capacity.enable_ramp
: If true, puts the actual matmul loops into a start, middle, and end phase where cores are grouped and are gradually turned on/off within the group.initial_cores
: How many cores to start/end with. This is used to determine the group size. ie. If 64 cores total, initial cores of 8 means group size of 8; if 56 cores total, initial cores of 8 means group size of 7.ramp_multiple
: How fast we turn on/off cores. If initial cores of 8 and ramp multiple of 2, you the second loop 16 cores will be running, etc... This applies to bothenable_dummy_loops
andenable_ramp
feature, although they are implemented slightly differently.NOTE: You can probably run both features together, but I haven't tested that.
Test Results with PMON Capture
Test are run on
sjc-snva-t3005
with chip 0, since this is the chip that consistently hangs in the 8-chip hang repro.PMON visualization:
PMON fit:
Enable ramp tests
I modified the test to run only 5000 loops for captures instead of 10000. Test command is this (
initial_cores
is varied for experiments):Baseline (ie. don't set anything related to
TT_MATMUL_RAMP_CONFIG
):Baseline with stagger (
TT_ENABLE_MATMUL_STAGGER=1
):Initial cores 1:
Initial cores 8:
Initial cores 32:
Enable dummy loops tests
IMPORTANT EDIT: I think I may have messed up my implementation of these dummy loops. Each core acts independently of one another. So apart from some basic overhead of arithmetic on risc, the cores may be starting all at roughly the same time, with the "staggered" cores turning off near the end of ramp up. During ramp down, this should be fine as is since cores all start off doing work and slowly drop off. To properly implement the ramp up, we need to do some syncing between cores. We can do this by using simple mcasting:
Single chip tests
Same as before except I turn on
enable_dummy_loops
and turn offenable_ramp
:Baseline:
Initial cores 1:
Initial cores 8:
Multichip tests
Intent here is to see if voltage is different when running multi-chip workloads vs. single chip. PMON by default captures from chip 0, which happens to be the chip I am running my single chip tests on. If you are running single chip tests on another chip, you need to switch which device you are running PMON from.
Same environment setup as single chip tests, but switch the pytest option to the 8 chips one:
Baseline:
Initial cores 1:
Initial cores 8: