tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware
Apache License 2.0
9 stars 5 forks source link

Cluster doesn't support multiple unconnected clusters #226

Open broskoTT opened 1 week ago

broskoTT commented 1 week ago

A follow up from #165 Due to the way get_closest_mmio_capable_chip works, and due to the way create-ethernet-map works, it doesn't work correctly for multiple unconnected clusters. Copying some details from the chat:

It only uses eth_coord_t data for each chip to determine this. It looks to me like this code assumes that all the chips are connected in some way, and doesn't handle two separate clusters of chips (like three N300 cards that I have). So the code currently returns the same chip for any remote chip (so it will return an unconnected local chip sometimes).

Daniel Rosen: But imo we don’t need to worry too much about these corner cases, I think that tt-fabric is looking to be ready in December-ish (if not sooner) and that’ll completely change the cem and coordinate requirements

pjanevskiTT commented 1 day ago

@broskoTT label and assign priority for this please, not sure about it

broskoTT commented 9 hours ago

@abhullar-tt raised that they need this for clusters with multiple blackholes, bumping priority