tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
449 stars 65 forks source link

[Blackhole xfunc] Get faster BH machines #10976

Closed abhullar-tt closed 1 month ago

abhullar-tt commented 2 months ago

Two parts:

  1. Need faster (and more) hosts so that CI doesn't become a bottleneck as more op owners start to do dev for BH. We currently have 2 machines for CI
  2. Need cards with faster AI CLK to unblock perf testing

FYI @davorchap @pgkeller

abhullar-tt commented 2 months ago

@TT-billteng do you know if there are any BHs marked for Cloud/Metal CI that will match the existing CI machine specs. If not, do you know how many we may need if we eventually want BH to be tested as part of post commit?

TT-billteng commented 2 months ago

@TT-billteng do you know if there are any BHs marked for Cloud/Metal CI that will match the existing CI machine specs. If not, do you know how many we may need if we eventually want BH to be tested as part of post commit?

So the AMD 7950x3d we just ordered should be much faster than any individual cloud VM instance we currently have. If this is still too slow, we need to investigate host perf on BH. Cloud is in the process of densifying the machines with more cards (upgrading each machine to 8 cards from 4). I'm in the process of qualifying perf on 8vCPU VMs (we've been running on 14vCPUs for CI). If BH actually needs far more resources on host side for whatever reason, this will upend cloud's roadmap and we need to let them know ASAP.

As for putting blackhole CI into regular post-commit, it'll depend on how many tests we activate vs. how many BH runners we have. As a point of reference, we currently have 30-35 CI runners of each card type (E150/N150/N300).

abhullar-tt commented 2 months ago

We have faster BH machines but it seems like post commit hasn't been running faster due to https://github.com/tenstorrent/tt-metal/issues/11717

pgkeller commented 1 month ago

@abhullar-tt can we close this?

abhullar-tt commented 1 month ago

yes