Open rfurko-tt opened 18 hours ago
@davorchap we are not convinced that it is related to the di/dt. My expectation didt would happen anyway if we have blocking EnqueueProgram.So this is pretty annoying issue. @tt-asaigal feel free to ping Roman if you need any help to reproduce this issue.
In the shared branch we forced matmul to use sublocks (1, 1) to minimize possibility di/dt. It didn't change frequency of freezes.
I took a look at the branch, and its fairly out of sync with latest main. A fairly non-deterministic hang was exposed by this commit pushed on Nov 8, for which a workaround went in yesterday. The branch has the hanging commit but not the workaround. Would it be possible to rebase on latest main and try the workload again?
Describe the bug Running training with GPT-2S results in frequent freezes after ~300k samples. Training restart also requires
tt-smi -r 0
Switch blocking from false to true removes the issue:tt::tt_metal::EnqueueProgram(queue, program, false);
To Reproduce
Expected behavior No freezes and crashes
Screenshots One thread callstack here: All other threads callstack:
Please complete the following environment information: