tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
484 stars 79 forks source link

CCL Op Support: Multi-Device Send/Recv Op #10411

Open cfjchu opened 4 months ago

cfjchu commented 4 months ago

This issue is to track the op-support for sending data from one sender chip to a receiver chip. This would be gating device pipelining and model parallel strategies for multi-device.

Work identified during discussion between @SeanNijjar and @cfjchu:

SeanNijjar commented 4 months ago

Design discussion imported from offline discussion:

Send/Receive Op

shape(Mesh_tensor_a) = [128,64] # DeviceView{1,2,7,8}
shape(Mesh_tensor_b) = [64,64] # DeviceView{6,12}

# assignment of mesh_tensor_a view to another mesh_tensor_b view 
# (remote devices)
mesh_tensor{5,6,11,12}
mesh_tensor_b[:,32:64] = mesh_tensor_a[:64, :64] 

send/receive takes as input (mesh_tensor views as above, ???target device grid???) -> returns a set of programs mapped to devices??? (not 1:1 mapping with tensor locations) -> NEED TO TALK WITH TTNN ABOUT THIS -> In order to send from “A” to “B” we need to launch programs on chips between them that have no tensors that are used by the ops. This is a departure from current TTNN APIs that assume input/output tensors always reside on the chips where the programs using those tensors are launched.

Diagram that shows some hypothetical example of a send/receive CCL op sending from a slice of a logical (mesh) tensor “A” to a slice of a logical (mesh) tensor “B”. In this case the send receive is to partial chunks/views of A and B. The way A and B are allocated, the source/destination slices/views don’t align. I.e. If you send “a” and “c” in a packet, you would need to split that logical packet into two separate ones on the destination side because “a” and “c” don’t live on the same chip in mesh tensor “B”

BUT at the same time, we don’t want to force “a” and “c” to be sent as discrete messages for the entire source->destination route. The send/receive “op” should be able to - if it thinks it’s better - to send {“a”, “c”} in a single message all the way until device 6 and only then will it split it off. -> For this reason, it makes sense that mesh tensor should not be deciding on how to split the view. Instead it should provide information to the op such as where each part of the mesh tensor lives, so the op can decide how to do the splitting/routing.

image

Unanswered Questions/To Be Designed:

1) Does the op need to be aware of the parts of the mesh tensor that are outside of the view? Does this matter for layout of things like sharded tensors? 2) How do we keep this forward compatible such that we can enable dynamic routing and op implementations can be implemented as sets of tensor slices and partial views 3) For this case, send-receive require information about sandbox/range route from mesh_tensor_a to mesh_tensor_b. The way we figure out which workers/devices are involved in the operation .. We need to decouple which devices may get a program launch vs. the devices on the output tensor

  • How do we even know about this concept of “constraining” a device grid. Sometimes it’s necessary for performance/to prevent serialization between independent send receives.
    • Could iterate on these grids and based on what op returns you can try to constrain further? (as an example) 4) TTNN Asks -> accept nulls as program returns 5) TTNN/Mesh tensor??? Ask -> invoke send/receive program
  • Need to reconcile that a mesh tensor with rows/cols being all-gathered will not want to span the full device grid (otherwise op will think we do a full 32 chip all-gather on galaxy) but for send/receive, we want to potentially be allowed to span the full grid because mesh tensor doesn’t apriori know the route taken
    • I think it makes sense to make the CCL op accept the full grid in both cases but look to the input tensor views for understanding of the op (in all-gather case, it would be given tensor views that span only a single row or column of devices, and so wouldn’t then go and invoke a 32-chip ring). For send-receive, it knows it needs to route so it’ll find a path somewhere through the grid.