tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware
Apache License 2.0
6 stars 3 forks source link

Array overrun in tt_SiliconDevice::wait_for_non_mmio_flush() #20

Closed pjanevskiTT closed 2 months ago

pjanevskiTT commented 2 months ago

Describe the bug read_device_memory(erisc_q_ptrs.data(), remote_transfer_ethernet_cores.at(chip_id)[i], ... is going past the size of remote_transfer_ethernet_cores.at(chip_id)

To Reproduce Change the [i]to .at(i) to allow range checking. Alternatively, enable Asan which catches these types of errors.

Steps to reproduce the behavior:

./build_metal.sh && ./build/test/tt_metal/unit_tests_fast_dispatch --gtest_filter="*TestWatcher*"

Additional context Reproduced on nebula_x1 (bgd-lab-20, single card reservation - board id 0)

With some logging I added When remote_transfer_ethernet_cores is populated we add two cores:

2024-07-12 09:37:23.209 | WARNING  | SiliconDriver   - remote_transfer_ethernet_cores[0][0] = (9,6)
2024-07-12 09:37:23.209 | WARNING  | SiliconDriver   - remote_transfer_ethernet_cores[0][1] = (1,6)

When it is accessed, we ask for the third one:

2024-07-12 09:37:23.209 | WARNING  | SiliconDriver   - NUM_ETH_CORES_FOR_NON_MMIO_TRANSFERS = 6
2024-07-12 09:37:23.210 | WARNING  | SiliconDriver   - Got chip 0 and getting core 0
2024-07-12 09:37:23.210 | WARNING  | SiliconDriver   - Got chip 0 and getting core 1
2024-07-12 09:37:23.210 | WARNING  | SiliconDriver   - Got chip 0 and getting core 2

Note: this is copy of https://github.com/tenstorrent/tt-metal/issues/10200 because Ivan couldn't make issue here since he doesn't have permissions at the moment

pjanevskiTT commented 2 months ago

Not assigning anyone, @abhullar-tt maybe you can help with this

abhullar-tt commented 2 months ago

Not assigning anyone, @abhullar-tt maybe you can help with this

The Metal issue has been assigned

tt-dma commented 2 months ago

Should be fixed with https://github.com/tenstorrent/tt-umd/pull/22