Open ttmtrajkovic opened 3 months ago
Findings so far for FF1 test on N150:
Sharded FF1 with gelu on bgd-lab-t3005 N150, fw 80.10.3
The workload fails only on HiFi2:
- gelu is irellevant to the failure
- all other possibilities (LoFi, HiFi3, HiFi4 etc, with or without gelu pass millions of iterations regularly)
Arc always seems dead when the failure occurs. In some percentage of the cases, arc dies, and the workload continues. In these cases, when the workload finishes, cold reboot is needed to bring back the chip. Otherwise, arc going weird corresponds to the hang
Branch sjovic/rebased_didt_tests_workload_exp
contains a hack to reduce the number of instructions used for matmul. Tiny-tile feature is used to do so, and there are two options:
TT_ENABLE_HALF_LOFI
env var, and behaves as if we have 32x16 tile on in1TT_ENABLE_QUARTER_LOFI
env var and behaves as if we have in0 tile 16x32 and in1 tile 32x16These hacks, together with the valid precision modes we have were used to understand how length of the workload and compute vs dm ration impacts the test cases. Testing was performed on bgd-lab-t3002 (board 3), and sjc-snva-t3006 (all boards), since these machines showed best hang repro rates.
LM head:
The conclusion is - bigger workload results in bigger droops; once we get to 1/4 of the workload, the droops get small enough to avoid hangs. Nominal voltage in the plots is always the same, around 920mV, since we are still dm bound.
FF1 without GELU:
The conclusion - bigger workload results in lower voltage minimums. Since we have throttling with this example, nominal voltage also decreases with heavier workload, so the diff between nominal and minimal voltage stays similar among all cases, but the absolute voltage minimum drops in any case.
Next steps: Understand whether lower workload passes due to higher misalignment between cores, since we know from stagger experiments that there is a bigger chance for a hang if cores are all in sync.
@pavlepopovic noticed the droops get larger when the pause between compute bursts are bigger. This observation is not aligned with the conclusions from this issue so I reevaluated them.
The plots show cases up to HiFi2 - I recaptured them for all fidelities, and it turns out for both LM head and FF1 the minimal voltage captured is not that different for different fidelities, goes +/- 10 mV without any pattern.
What is also weird is why in both cases we saw 1/4 LoFi passing.
For LM head, 1/2 and 1/4 LoFi impact volatge and there are visibly smaller droops, however since LM head block lasts only ~2200 cycles, reducing the length of compute might impact the window in which all or most cores work at the same time. Both baseline case and 1/4 LoFi have 78% of cores starting blocks within 300 cycles, while the biggest difference between two cores starting specific block is 60k cycles. Shorter compute means less chance for bigger number of cores working at the same time.
For FF1, 1/2 and 1/4 LoFi makes the workload DM bound and throttling is turned off, so minimal voltage is in any case bigger. The only thing that cannot be explained is why we observed 1/4 workload passing, while all the others failed. This was tested before 80.10.4.1 fw, which addressed current thresholds in FF1, so it would need to be repeated.
Analyze failing matmul workloads and gather info on:
Create separate test cases which will: