tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
410 stars 51 forks source link

BH prep #5174

Open pgkeller opened 7 months ago

pgkeller commented 7 months ago

This is a list of todos for BH. Must to run anything:

Phase1 for BH bring up - target 5/2

@abhullar-tt

Reem

Almeet/David


OLD NOTES BELOW:

Infra Flow - Versim scramble

  1. bring up versim in gitlab
  2. Getting stuff to build on public github
  3. fork to gitlab and test on versim

Development flow (MVP: metal running slow dispatch with a few ops) -

  1. Do the build stuff (make blackhole arch available and ability to build), fork into gitlab, validate versim submodule build.
  2. bring up runtime with slow dispatch and do metal "hello-world" in the github
  3. bring up simple ops with simple config (single core, native numeric) in the github
  4. bring up less simple ops in the github
  5. bring up key features/tool (watcher) + bring in key fixes (64 byte alignment) in the github

note: if we can staff (2) - (5) pretty soon, you should try to test on versim; if not, we should do it on the cards.
note: this flow will give us versim as a backup platform in case things don't work on the cards - but development and testing on both side (github/gitlab) is very cumbersome.

TODO:

Phase2 for BH 30-day milestone & Open Source BH SW - target 5/17 Metal Goal - Single Tensix OP [ ] Versim on CI [ ] MatMul workload to stress-test single Tensix core

Phase3 for BH 60-day milestone - target 6/17 Metal goal - Multi-Tensix OP [ ] MatMul workload to stress-test Multi Tensix cores [ ]

To be prioritized --> [ ] ? NOC/tensix shared access (need to enumerate) [ ] Eth IRAM? TBD if BH has IRAM on ethernet. If true - need changes for Eth support

Performance changes/new features (not required to run): [ ] NOC has a RISC-NOC command fifo which allows more non-blocking transactions in flight (legacy interface still works)

SFPU/I Optimizations / New Features:

Debug/analysis:

aliuTT commented 7 months ago

For performance changes:

mo-tenstorrent commented 7 months ago

At some point I overheard BH has interrupts support, at least on BRISC, if I remember correctly. Should confirm its functionality.

DrJessop commented 7 months ago

"64B-alignment for reads" Just wanted to make sure not a typo, since in GS/WH it's 32B-alignment.

DrJessop commented 7 months ago

Also, I recall there being some issues with there being a relatively small NOC max packet size of 8KB. Wondering if BH will have the same limitation.

pgkeller commented 7 months ago

"64B-alignment for reads" Just wanted to make sure not a typo, since in GS/WH it's 32B-alignment.

not a typo, needs to change

tt-rkim commented 5 months ago

Some risks I see from infra / CI side:

TT-billteng commented 5 months ago

perhaps we need to fork TTMetal on gitlab side and make necessary modifications

jliangTT commented 5 months ago

Versim cannot run on cloud network / metal CI unfortunately... we need some serious rethinking around CI / dev for this

my 2c: versim as a stop-gap to prefetch some the software development to build/bring up our sw stack and some unit test until hardware comes back. Once the hardware comes back, testing / development will be brought up on hardware and versim should be less relevant. As such, CI on versim may be throw away work.

jliangTT commented 5 months ago

per @rtawfik01 earlier on the current coverage on versim for Buda -

Unary datacopy -> start with single tile, single core, bfloat16, then test more combinations i.e more tiles, different dataformats, etc Unary SFPU ops Eltwise binary ops Reduce ops Matrix multiply/Convs Here we can start more testing combination ops/graphs -> layernorms, softmaxes, feedforwards, etc

With buda, we tested all kernels, but we started with the list above And always try every op with the simplest scenario (bfloat 16, single core, single tile), then add combinations as the tests pass

Also, Reem is getting the llk submodule ready and we should let her know when we complete bulding metal with the current blackhole arch compile through metal stack.

abhullar-tt commented 1 month ago

TODOs as of 31/07/2024

FYI @davorchap