Closed mo-tenstorrent closed 55 minutes ago
I have tried moving both mailbox and FW start around each by 4096 and the issue followed.
Any chance this could be fixed by https://github.com/tenstorrent/tt-metal/pull/15335 ? Maybe we were accidentally grabbing data from the wrong cores.
That and inconsistency in the usage of hal and device version of get_dev_addr<profiler_msg_t *>
was the root cause. Cleaning all that up and using the device version everywhere fixed the issue.
Essentially some parts of the profiler code were looking at active eths' profiler buffer address for and idle eth.
Moving this to the profiler board.
This was an issue with host profiler code and how it dealt with idle_eth. The fix for this will come as part of https://github.com/tenstorrent/tt-metal/issues/10234
Models team observed tat profiling eth dispatch cores caused segfaults.
The segfaults are because of reading bad dram buffer index from profiler L1 control buffer.
Turns out, the profiler control buffer in the mailbox is getting corrupted in
cq_prefetch
andvc_th_tunneler
.This was verified by turning device side profiling fully off and initializing the buffer from host and reading it back at the end of a eth dispatch run.
In core 2,6 of device 3 we get the following read back:
according to watcher that core is running
cq_prefetch
The buffer needs to read the following however:
repro steps:
15530_profiler_buffer_corruption
./build_metal.sh -p
export TT_METAL_DEVICE_PROFILER=1 export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml export TT_METAL_DEVICE_PROFILER_DISPATCH=1
pytest tests/ttnn/tracy/test_profiler_sync.py::test_all_devices