tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
466 stars 73 forks source link

Track and report number of retrains for each Eth Link during a test #10453

Open davorchap opened 3 months ago

davorchap commented 3 months ago

fyi @ubcheema @aliuTT @SeanNijjar

mo-tenstorrent commented 3 months ago

As per our conversation with Bing Li, 0x1ec0 + (7 * 0x4) is the L1 location for the counter.

@davorchap how are you imagining this? Do you want this in the op report, collected per op?

SeanNijjar commented 3 months ago

It's outside of the scope of this issue but I think we need to also track this outside of programs. This information should be exported to a system level somehow as well so that way a system monitor like tt-smi can present some stats in a link health view (time since last retrain, average downtime, % uptime, etc.)