tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
478 stars 78 forks source link

[Bug Report] Frequent freezes during training #15349

Open rfurko-tt opened 18 hours ago

rfurko-tt commented 18 hours ago

Describe the bug Running training with GPT-2S results in frequent freezes after ~300k samples. Training restart also requires tt-smi -r 0 Switch blocking from false to true removes the issue: tt::tt_metal::EnqueueProgram(queue, program, false);

To Reproduce

  1. Branch: rfurko/kahan_summation
  2. Run tt-train/nano-gpt example
  3. Wait for 500-2000 steps
  4. Freeze

Expected behavior No freezes and crashes

Screenshots One thread callstack here: Image All other threads callstack: Image

Please complete the following environment information:

dmakoviichuk-tt commented 18 hours ago

@davorchap we are not convinced that it is related to the di/dt. My expectation didt would happen anyway if we have blocking EnqueueProgram.So this is pretty annoying issue. @tt-asaigal feel free to ping Roman if you need any help to reproduce this issue.

rfurko-tt commented 18 hours ago

In the shared branch we forced matmul to use sublocks (1, 1) to minimize possibility di/dt. It didn't change frequency of freezes.

tt-asaigal commented 6 minutes ago

I took a look at the branch, and its fairly out of sync with latest main. A fairly non-deterministic hang was exposed by this commit pushed on Nov 8, for which a workaround went in yesterday. The branch has the hanging commit but not the workaround. Would it be possible to rebase on latest main and try the workload again?