Test matmuls with different compute vs data movement ratio

ttmtrajkovic commented 3 months ago

Analyze failing matmul workloads and gather info on:

length of compute
length of data movement
what is the workload bound with, compute or data movement? Should be compute

Create separate test cases which will:

extend the length of compute to change test compute bound workload
reduce the amount of compute while keeping DM length the same

s-jovic commented 2 months ago

Findings so far for FF1 test on N150:

Sharded FF1 with gelu on bgd-lab-t3005 N150, fw 80.10.3

The workload fails only on HiFi2:

gelu is irellevant to the failure

all other possibilities (LoFi, HiFi3, HiFi4 etc, with or without gelu pass millions of iterations regularly)

Arc always seems dead when the failure occurs. In some percentage of the cases, arc dies, and the workload continues. In these cases, when the workload finishes, cold reboot is needed to bring back the chip. Otherwise, arc going weird corresponds to the hang

s-jovic commented 1 month ago

Branch sjovic/rebased_didt_tests_workload_exp contains a hack to reduce the number of instructions used for matmul. Tiny-tile feature is used to do so, and there are two options:

half-lofi - we do approx half the instructions we need for LoFi matmul; this is enabled with TT_ENABLE_HALF_LOFI env var, and behaves as if we have 32x16 tile on in1
quarter-lofi - we do approx 1/4 of the instructions needed for LoFi matmul; this is enabled with TT_ENABLE_QUARTER_LOFI env var and behaves as if we have in0 tile 16x32 and in1 tile 32x16

These hacks, together with the valid precision modes we have were used to understand how length of the workload and compute vs dm ration impacts the test cases. Testing was performed on bgd-lab-t3002 (board 3), and sjc-snva-t3006 (all boards), since these machines showed best hang repro rates.

LM head:

hangs with 1/2 lofi, lofi, and all hifi precisions; passes only with 1/4 lofi

The conclusion is - bigger workload results in bigger droops; once we get to 1/4 of the workload, the droops get small enough to avoid hangs. Nominal voltage in the plots is always the same, around 920mV, since we are still dm bound.

FF1 without GELU:

hangs with 1/2 lofi, lofi, and all hifi precisions; passes only with 1/4 lofi

The conclusion - bigger workload results in lower voltage minimums. Since we have throttling with this example, nominal voltage also decreases with heavier workload, so the diff between nominal and minimal voltage stays similar among all cases, but the absolute voltage minimum drops in any case.

Next steps: Understand whether lower workload passes due to higher misalignment between cores, since we know from stagger experiments that there is a bigger chance for a hang if cores are all in sync.

s-jovic commented 1 month ago

@pavlepopovic noticed the droops get larger when the pause between compute bursts are bigger. This observation is not aligned with the conclusions from this issue so I reevaluated them.

The plots show cases up to HiFi2 - I recaptured them for all fidelities, and it turns out for both LM head and FF1 the minimal voltage captured is not that different for different fidelities, goes +/- 10 mV without any pattern.

What is also weird is why in both cases we saw 1/4 LoFi passing.

For LM head, 1/2 and 1/4 LoFi impact volatge and there are visibly smaller droops, however since LM head block lasts only ~2200 cycles, reducing the length of compute might impact the window in which all or most cores work at the same time. Both baseline case and 1/4 LoFi have 78% of cores starting blocks within 300 cycles, while the biggest difference between two cores starting specific block is 60k cycles. Shorter compute means less chance for bigger number of cores working at the same time.

For FF1, 1/2 and 1/4 LoFi makes the workload DM bound and throttling is turned off, so minimal voltage is in any case bigger. The only thing that cannot be explained is why we observed 1/4 workload passing, while all the others failed. This was tested before 80.10.4.1 fw, which addressed current thresholds in FF1, so it would need to be repeated.

tenstorrent / tt-metal

Test matmuls with different compute vs data movement ratio #11305