tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
423 stars 54 forks source link

Multi Device Object Spec #5395

Closed ntarafdar closed 4 months ago

ntarafdar commented 7 months ago

Currently our description of massive parallelism comes from sharding but that is within a single core. Our implementation of multi-device sharding is approximated using a slice op on host followed by sharded buffers on multiple respective devices. This is because an allocator is currently on a single device level. Changing this is a big task which is why for now we wish to keep a multi-device sharded tensor across N devices as N allocated tensors.

This however increases verbosity for the user and mental burden to have to track the the slice and gather op as well as N separate tensors. Furthermore, ops ingest the single tensor need to be replicated for each individual device.

We propose adding a MultiDeviceStorage struct which can be a variant under DeviceStorage. The tensor object requires minimal changes, as the storage type only matters when moving to and from device.

MultiDeviceStorage():
  array< N devices available, pair()> tensor_start_end;

For all devices available the MultiDeviceStorage will have an associated pair that corresponds to the slices of the original tensor that will exist on the device. This flexibility allows us to describe operations that might replicate portions of a tensor or select subsets of a tensor. Ops ingesting a multi-device tensor can manipulate this field depending on the op. Currently other than the all-gather-op most ops are single device ops. To extend them to multi-device, it would be adding evenly divided slices in the tensor_start_end on both the input and output.

The slice information will be used with the to and cpu ops of a tensor that is responsible for moving the tensor from host to device.

Since this is all done at the tt-lib level there should not be any changes at TTNN to make this possible. The tensor object remains the same, and they wrap the to and cpu operations such that everything else should work.

ntarafdar commented 7 months ago

Enumerate Multi-Device Ops to drive this

jliangTT commented 7 months ago

Not sure which board to triage to. putting it in ttnn- infra for now. let me know if this is more of a kernel thing.