Create stress tests for galaxy

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

388 stars 47 forks source link

Create stress tests for galaxy #10624

Open tt-rkim opened 1 month ago

tt-rkim commented 1 month ago

@jvasilje asked for some 24h stress tests for galaxy like we do for single card.

We should start with TG.

cc: @ttmchiou @TT-billteng

tt-rkim commented 1 month ago

We can help with making the workflow, but galaxy runtime devs should choose what tests to put in there.

aliuTT commented 1 month ago

How often do you want to run the 24h stress tests? It's straight forward enough to add. To start, we can add extend the TG unit tests to all chips, maybe loop that for 24 hours? I think models are still a bit unstable for now, will likely not survive 24 hours 🧟

tt-rkim commented 1 month ago

That's fine with me. When you say unstable. is it hangs?

aliuTT commented 1 month ago

Yes.

tt-rkim commented 2 weeks ago

@aliuTT Any updates on the breadth of galaxy unit tests that are stable enough to add to a stress test, and are comprehensive enough that it touches all modules?

aliuTT commented 2 weeks ago

We have a CommandQueueSingleCardFixture test suite that could work well here (take a look at tests/tt_metal/tt_metal/unit_tests_fast_dispatch/common/command_queue_fixture.hpp). Today it does two things:

No env flag: only runs on chip 0
TT_METAL_ENABLE_REMOTE_CHIP=1 sweeps test over remote chips I can add an env flag to sweep over all chips in the system?

tt-rkim commented 2 weeks ago

That would be great!

tt-rkim commented 1 week ago

@SeanNijjar Since Allan is OOO right now, are there any important unit tests on TG that we could add to this?

tt-rkim commented 2 days ago

@SeanNijjar ping on this again

Anything we could add?

SeanNijjar commented 2 days ago

Hey @tt-rkim - Sorry I missed this, I was out last week and must have slipped through the cracks when I came back.

Basically, we've got the TG frequent tests and they're structured such that we can take the same tests and start adding the regular t3k allgather nightly tests to TG as a starting point. I don't think that's something you should be concerned with unless you have spare cycles.

Probably the best starting place would be to take @cfjchu's data parallel llama70B where he instantiated 4 copies of the llama 70B model on a galaxy and run this in a loop