tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

480 stars 78 forks source link

#15243: Add retries to cpp tests #15270

Open tt-rkim opened 2 days ago

tt-rkim commented 2 days ago

Ticket

15243

Problem description

CPP tests are unstable and fail ND often on main

What's changed

Retry to improve stability. Need to be able to reproduce so we can raise to runtime team

Checklist

[ ] Post commit CI passes
[ ] Blackhole Post commit (if applicable)
[ ] Model regression CI testing passes (if applicable)
[ ] Device performance regression CI testing passes (if applicable)
[ ] New/Existing tests provide coverage for changes

tt-rkim commented 2 days ago

passing cpp tests: https://github.com/tenstorrent/tt-metal/actions/runs/11940411942/job/33284197881

TT-billteng commented 1 day ago

Does retrying tt-smi actually work?

ttmchiou commented 1 day ago

If retries work, maybe we can reproduce it by repeating the cpp tests in a custom dispatch, maybe also try to run it on the same machine too If this fixes the issue, doesn't this mean its a issue caused by the tests?