tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
467 stars 73 forks source link

Tracy segmentation faults when profiling UNet with `TT_METAL_DEVICE_PROFILER` #12768

Open esmalTT opened 1 month ago

esmalTT commented 1 month ago

Summary

Tracy GUI segfaults when using the option TT_METAL_DEVICE_PROFILER=1.

Running Tracy with -r option instead also crashes the script:

AssertionError: Device data mismatch: Expected 174 but received 430 ops on device 0. Device is showing op ID 170 when host is showing op ID 86

Steps to Reproduce

Checkout esmal/enable-trace-2cq and run the following:

TT_METAL_DEVICE_PROFILER=1 TT_METAL_DEVICE_PROFILER_DISPATCH=1 TT_METAL_PROFILER_SYNC=1 \
python -m tracy -l -m pytest models/experimental/functional_unet/tests/test_unet_trace.py::test_unet_trace_2cq
esmalTT commented 1 month ago

Some additional debugging - it is possible to get things working with TT_METAL_DEVICE_PROFILER=1 TT_METAL_PROFILER_SYNC=1 python -m tracy -l -m pytest models/experimental/functional_unet/tests/test_unet_trace.py::test_unet_trace by adding some additional ttnn.DumpDeviceProfile calls, but adding TT_METAL_DEVICE_PROFILER_DISPATCH causes the Tracy GUI crash. This crash has a preceding warning:

WARNING  | Profiler DRAM buffers were full, markers were dropped! device 0, worker core 1, 11, Risc NCRISC,  bufferEndIndex = 15872