Open nsmithtt opened 3 months ago
this is the 2nd request for this feature in 24 hours. @sankarmanoj-tt noted that currently translation happens on the host and for convolution mcasts this info is passed through runtime args.
This shouldn't be hard to do. The table will be loaded/built w/ the firmware and accessible from the kernel. implementation will vary across GS/WH/BH. there will be some (minor) performance penalty for the lookup.
seems this issue should be tracked on the metal runtime board?
Is there a priority/timeframe for this? Currently have no resources available
This isn't super urgent, just filed the issue for tracking. Our main use case is to be used in tandem with https://github.com/tenstorrent/tt-metal/issues/10702, i.e. to make traces more portable. I think we will be more interested in this over the course of 2-3 months down the road. Added to the metal runtime board, marked as P2 for us, @sankarmanoj-tt might have different prioritization.
Virtualizing traces/portability is a great feature.
We can also consider using "virtual NOCs coord" (ie translated cords) , since we have HW support for these. Kernel wouldn't have to do logical -> physical translation at run-time. Passing virtual coords as compile-time args can also lead to more compile-time optimizations of kernel.
Is there any reason/preference to using run-time look up vs. virtual NOC coords (these were used in BUDA)?
We need to also take DRAM and ETH into account, in addition to Tensix mesh. And potential harvesting of those in BH.
Virtualizing traces/portability is a great feature.
We can also consider using "virtual NOCs coord" (ie translated cords) , since we have HW support for these. Kernel wouldn't have to do logical -> physical translation at run-time. Passing virtual coords as compile-time args can also lead to more compile-time optimizations of kernel.
If we have HW support for translating the core coords that would be great. So I could write get_noc_addr(0, 0, l1_offset);
and under the hood the HW can translate this to physical coord 1-2
?
Is there any reason/preference to using run-time look up vs. virtual NOC coords (these were used in BUDA)?
I think the preference for runtime lookup was just for trace portability since the riscv binaries are already compiled and embedded in trace. If we compile/trace on one n300 then we want to be able to move it to another n300 with different harvested rows and have the same trace run.
just had a discussion on this: plan at the moment is for metal to move to virtual (translated) coordinates and not expose physical coordinates through the API. dispatch would use virtual coordinates. if a program is compiled differently across devices for any reason (imagine passing the device id as a compile time argument), the trace could still be uniform but the kernel binary load would be different (and runtime would have to use a max_size at dispatch time to share the trace).
@tt-asaigal @cfjchu
@pgkeller, are we going to use HW feature "NOC coordinate translation"?
yes
In order to have at least some portability of serialized traces #10702, it would be great to have metal runtime supply virtualized core coordinates that get loaded during device runtime. This would enable a trace captured on a harvested n300 to be replayable on another n300 which has the same logical core grid, but a different set of harvested rows.
One way this could work would be to have a special reserved section of local memory which holds a mapping that the
get_noc_address
dataflow API could index under the hood. This would for example translate logical core coord[3, 0]
to physical coord4-1
.