Currently our description of massive parallelism comes from sharding but that is within a single core.
Our implementation of multi-device sharding is approximated using a slice op on host followed by sharded buffers on multiple respective devices.
This is because an allocator is currently on a single device level. Changing this is a big task which is why for now we wish to keep a multi-device sharded tensor across N devices as N allocated tensors.
This however increases verbosity for the user and mental burden to have to track the the slice and gather op as well as N separate tensors.
Furthermore, ops ingest the single tensor need to be replicated for each individual device.
We propose adding a MultiDeviceStorage struct which can be a variant under DeviceStorage.
The tensor object requires minimal changes, as the storage type only matters when moving to and from device.
MultiDeviceStorage():
array< N devices available, pair()> tensor_start_end;
For all devices available the MultiDeviceStorage will have an associated pair that corresponds to the slices of the original tensor that will exist on the device. This flexibility allows us to describe operations that might replicate portions of a tensor or select subsets of a tensor.
Ops ingesting a multi-device tensor can manipulate this field depending on the op.
Currently other than the all-gather-op most ops are single device ops. To extend them to multi-device, it would be adding evenly divided slices in the tensor_start_end on both the input and output.
The slice information will be used with the to and cpu ops of a tensor that is responsible for moving the tensor from host to device.
Since this is all done at the tt-lib level there should not be any changes at TTNN to make this possible. The tensor object remains the same, and they wrap the to and cpu operations such that everything else should work.
Currently our description of massive parallelism comes from sharding but that is within a single core. Our implementation of multi-device sharding is approximated using a slice op on host followed by sharded buffers on multiple respective devices. This is because an allocator is currently on a single device level. Changing this is a big task which is why for now we wish to keep a multi-device sharded tensor across
N
devices asN
allocated tensors.This however increases verbosity for the user and mental burden to have to track the the slice and gather op as well as N separate tensors. Furthermore, ops ingest the single tensor need to be replicated for each individual device.
We propose adding a
MultiDeviceStorage
struct which can be a variant underDeviceStorage
. The tensor object requires minimal changes, as the storage type only matters when moving to and from device.For all devices available the MultiDeviceStorage will have an associated pair that corresponds to the slices of the original tensor that will exist on the device. This flexibility allows us to describe operations that might replicate portions of a tensor or select subsets of a tensor. Ops ingesting a multi-device tensor can manipulate this field depending on the op. Currently other than the all-gather-op most ops are single device ops. To extend them to multi-device, it would be adding evenly divided slices in the
tensor_start_end
on both the input and output.The slice information will be used with the
to
andcpu
ops of a tensor that is responsible for moving the tensor from host to device.Since this is all done at the tt-lib level there should not be any changes at TTNN to make this possible. The tensor object remains the same, and they wrap the
to
andcpu
operations such that everything else should work.