tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 75 forks source link

New FW drop break ability to target eth cores on BH P100s #14613

Open abhullar-tt opened 2 weeks ago

abhullar-tt commented 2 weeks ago

New eth FW and FW for tt-smi reset has broken our ability to target / load fw on eth cores. This is seen in p100s which do not have any active eth cores.

This problem is showing up on p100s and p150s

CI machines are running this new FW and currently no tests can be run.

Workaround in main: Push a patch to make it look like BH has no eth cores

bingliTT commented 2 weeks ago

Could you walk me through the sequence being used to load fw on the eth cores?

abhullar-tt commented 2 weeks ago

Could you walk me through the sequence being used to load fw on the eth cores?

  1. assert risc reset on all eth cores
  2. host writes erisc FW binary to L1
  3. host program a jump to start addr of FW
  4. host issues l1 barrier to make sure steps 2 and 3 are completed
  5. eth cores get deasserted
  6. host polls eth cores for done signal (supplied by eth FW)

This sequence has been working for the p100s and nothing has changed here with the new FW

abhullar-tt commented 2 weeks ago

Adding @TTDRosen for visibility

abhullar-tt commented 1 week ago

This issue was expected to be hit on ethernet cores with active links on P150s but is not expected on:

Looks like we are running into this in the unexpected cases because FW on these link-less cores is running an init sequence. Proposed solution from @bingliTT is for FW on these cores to skip straight to a heartbeat counter which shouldn't have any issues if Metal is using eth risc0

abhullar-tt commented 1 week ago

FYI @ttmchiou