tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
420 stars 54 forks source link

TTNN sweep infrastructure revamp planning #8957

Closed sjameelTT closed 2 months ago

sjameelTT commented 4 months ago

Some of the features that are desired include (thanks Brian for the suggestions):

Major features

Minor features:

ntarafdar commented 4 months ago

I was thinking a bit more about hangs. A lot of hangs aren't hangs if we have watcher running with it. Do you think we could have our framework interact with watcher and record the watcher crash (e.g watcher says this kernel on this core read/wrote the wrong memory at this address).

I personally think this would be a nice thing to track. Each test is either (possibly not exhaustive list):

I kind of want a finer granularity instead of just a hang as this will make it a lot easier to debug.

sjameelTT commented 4 months ago

@jvasilje the issue tracking the sweep infra changes

sjameelTT commented 4 months ago

Also want to add:

jdesousa-TT commented 2 months ago

Most of the items in this issue were covered in this PR: #10706 Remaining items are broken into issues #10855 and #10856

We are working with @kyma-tt and team to setup a production database for our test vectors and results.

All the details surrounding what was included in the framework so far is available in the tests/sweep_framework/README.md file and on these slides: https://docs.google.com/presentation/d/1zo1pS40IZAcVFivAUSJIYljjVerGWJccpVJsP5KL00E/edit?usp=sharing