Open skhorasganiTT opened 1 month ago
aus-glx-06
(n150 fw bundle version 80.10.1.0) appears to be particularly slow and pauses more frequently when compared to aus-glx-09
(n150 fw bundle version 80.10.0.0), not sure if the difference in n150 fw is relevant.
out of curiosity (to rule out EDM causing the problem), can you try changing SWITCH_INTERVAL
from ttnn/cpp/ttnn/operations/ccl/kernels/edm/erisc_datamover.cpp
to a much smaller number (maybe a couple million)? This will ensure that it's not a cause of one end starting link training on a downed link, then the other end stuck in kernel space, the initiator side timing out and context switching back and then the second ethernet code on the other end of the link context switching to link training and then both ends alternating until the both happen to be in the training window?
Falcon7b is purely data parallel (no ccl ops) so we can rule that out as being a problem.
@skhorasganiTT does the full demo also show this?
@uaydonat I have observed this on the full demo as well for aus-glx-06
while on aus-glx-09
it is not very noticeable (similar to the previous comment).
next: try with trace figure out if it is caused by cpu?
When running the Falcon7B MLP (input-host-to-device + forward-pass + output-device-to-host) on galaxy in a loop of 10000 iterations, the loop occasionally pauses (i.e. freezes, but does not hang and resumes after several seconds to a couple minutes) randomly (happened roughly 6-7 times in the loop which totalled ~10 min).
Branch: skhorasgani/falcon7b_ttnn_multidev_testglx Command:
pytest models/demos/falcon7b_common/tests/test_falcon_mlp.py::test_FalconMLP_inference[wormhole_b0-True-BFLOAT16-L1-decode_batch32-True-32chipTG]
Machine:aus-glx-06