tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
478 stars 78 forks source link

Profiler mailbox buffers are getting corrupted in idle_eth dispatch kernels #15330

Closed mo-tenstorrent closed 55 minutes ago

mo-tenstorrent commented 22 hours ago

Models team observed tat profiling eth dispatch cores caused segfaults.

The segfaults are because of reading bad dram buffer index from profiler L1 control buffer.

Turns out, the profiler control buffer in the mailbox is getting corrupted in cq_prefetch and vc_th_tunneler.

This was verified by turning device side profiling fully off and initializing the buffer from host and reading it back at the end of a eth dispatch run.

In core 2,6 of device 3 we get the following read back:

device id:3, x:2, y:6, i:0, d:160
device id:3, x:2, y:6, i:1, d:0
device id:3, x:2, y:6, i:2, d:0
device id:3, x:2, y:6, i:3, d:0
device id:3, x:2, y:6, i:4, d:0
device id:3, x:2, y:6, i:5, d:0
device id:3, x:2, y:6, i:6, d:0
device id:3, x:2, y:6, i:7, d:0
device id:3, x:2, y:6, i:8, d:4
device id:3, x:2, y:6, i:9, d:16
device id:3, x:2, y:6, i:10, d:32
device id:3, x:2, y:6, i:11, d:32
device id:3, x:2, y:6, i:12, d:65799
device id:3, x:2, y:6, i:13, d:1
device id:3, x:2, y:6, i:14, d:98800
device id:3, x:2, y:6, i:15, d:0
device id:3, x:2, y:6, i:16, d:8
device id:3, x:2, y:6, i:17, d:0
device id:3, x:2, y:6, i:18, d:0
device id:3, x:2, y:6, i:19, d:0
device id:3, x:2, y:6, i:20, d:1001045187
device id:3, x:2, y:6, i:21, d:3171794203
device id:3, x:2, y:6, i:22, d:3137157588
device id:3, x:2, y:6, i:23, d:3169631398
device id:3, x:2, y:6, i:24, d:5
device id:3, x:2, y:6, i:25, d:16
device id:3, x:2, y:6, i:26, d:32
device id:3, x:2, y:6, i:27, d:32
device id:3, x:2, y:6, i:28, d:3
device id:3, x:2, y:6, i:29, d:0
device id:3, x:2, y:6, i:30, d:26880016
device id:3, x:2, y:6, i:31, d:26880016

according to watcher that core is running cq_prefetch

The buffer needs to read the following however:

i:0, d:0
i:1, d:0
i:2, d:0
i:3, d:0
i:4, d:0
i:5, d:0
i:6, d:0
i:7, d:0
i:8, d:0
i:9, d:0
i:10, d:0
i:11, d:0
i:12, d:32
i:13, d:0
i:14, d:0
i:15, d:0
i:16, d:25
i:17, d:7
i:18, d:0
i:19, d:0
i:20, d:0
i:21, d:0
i:22, d:0
i:23, d:0
i:24, d:0
i:25, d:0
i:26, d:0
i:27, d:0
i:28, d:0
i:29, d:0
i:30, d:0
i:31, d:0

repro steps:

  1. checkout out 15530_profiler_buffer_corruption
  2. ./build_metal.sh -p
  3. export TT_METAL_DEVICE_PROFILER=1 export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml export TT_METAL_DEVICE_PROFILER_DISPATCH=1
  4. pytest tests/ttnn/tracy/test_profiler_sync.py::test_all_devices
mo-tenstorrent commented 22 hours ago

I have tried moving both mailbox and FW start around each by 4096 and the issue followed.

jbaumanTT commented 19 hours ago

Any chance this could be fixed by https://github.com/tenstorrent/tt-metal/pull/15335 ? Maybe we were accidentally grabbing data from the wrong cores.

mo-tenstorrent commented 18 hours ago

That and inconsistency in the usage of hal and device version of get_dev_addr<profiler_msg_t *> was the root cause. Cleaning all that up and using the device version everywhere fixed the issue.

Essentially some parts of the profiler code were looking at active eths' profiler buffer address for and idle eth.

Moving this to the profiler board.

mo-tenstorrent commented 55 minutes ago

This was an issue with host profiler code and how it dealt with idle_eth. The fix for this will come as part of https://github.com/tenstorrent/tt-metal/issues/10234