tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
393 stars 48 forks source link

Random pauses in loop of Falcon7B MLP on galaxy #11165

Open skhorasganiTT opened 1 month ago

skhorasganiTT commented 1 month ago

When running the Falcon7B MLP (input-host-to-device + forward-pass + output-device-to-host) on galaxy in a loop of 10000 iterations, the loop occasionally pauses (i.e. freezes, but does not hang and resumes after several seconds to a couple minutes) randomly (happened roughly 6-7 times in the loop which totalled ~10 min).

Branch: skhorasgani/falcon7b_ttnn_multidev_testglx Command: pytest models/demos/falcon7b_common/tests/test_falcon_mlp.py::test_FalconMLP_inference[wormhole_b0-True-BFLOAT16-L1-decode_batch32-True-32chipTG] Machine: aus-glx-06

skhorasganiTT commented 1 month ago

aus-glx-06 (n150 fw bundle version 80.10.1.0) appears to be particularly slow and pauses more frequently when compared to aus-glx-09 (n150 fw bundle version 80.10.0.0), not sure if the difference in n150 fw is relevant.

SeanNijjar commented 1 month ago

out of curiosity (to rule out EDM causing the problem), can you try changing SWITCH_INTERVAL from ttnn/cpp/ttnn/operations/ccl/kernels/edm/erisc_datamover.cpp to a much smaller number (maybe a couple million)? This will ensure that it's not a cause of one end starting link training on a downed link, then the other end stuck in kernel space, the initiator side timing out and context switching back and then the second ethernet code on the other end of the link context switching to link training and then both ends alternating until the both happen to be in the training window?

skhorasganiTT commented 1 month ago

Falcon7b is purely data parallel (no ccl ops) so we can rule that out as being a problem.

uaydonat commented 4 weeks ago

@skhorasganiTT does the full demo also show this?

skhorasganiTT commented 4 weeks ago

@uaydonat I have observed this on the full demo as well for aus-glx-06 while on aus-glx-09 it is not very noticeable (similar to the previous comment).

uaydonat commented 3 weeks ago

next: try with trace figure out if it is caused by cpu?