Support `toLayout` API - Githubissues

nsmithtt commented 2 months ago

Per new Runtime Stitching Spec we want to implement the new toLayout runtime API.

https://tenstorrent.github.io/tt-mlir/specs/runtime-stitching.html

Hopefully we can write a simple test using gtest #102 to demonstrate its functionality.

nsmithtt commented 1 month ago

@jnie-TT, @pilkicTT, @kmabeeTT I was thinking a bit more about this API and I think we should tweak it slightly so that runtime doesn't have to use flatbuffer API + capturing more advanced use cases:

Current Proposal

Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);

What I think we should change it to be:

Event toLayout(Tensor dst, Tensor src, std::vector<Event> dependencies = {});

We need an additional API for creating device tensors, currently we have (for creating host tensors):

Tensor createTensor(std::shared_ptr<void> data,                                 
                    std::vector<std::uint32_t> const &shape,                   
                    std::vector<std::uint32_t> const &stride,                  
                    std::uint32_t itemsize, ::tt::target::DataType dataType);

Device tensor API (can potentially fail allocation if there isnt' room on device):

Tensor allocateInputTensor(Binary executable, int inputIndex);
Tensor allocateOutputTensor(Binary executable, int outputIndex);

Example usage:

Host to device:

Tensor host_tensor = createTensor(...);
Tensor device_tensor = createInputTensor(executable, 0);
toLayout(device_tensor, host_tensor);
submit(device, executable, {device_tensor}, ...);

Device to device:

Tensor outA = createOutputTensor(executableA, 0);
submit(device, executableA, {...}, {outA});
Tensor inB = createInputTensor(executableB, 0);
toLayout(inB, outA);
submit(device, executableB, {inB}, {...});

Device to host:

Tensor host_tensor = createTensor(...);
Tensor out = createOutputTensor(executable, 0);
submit(device, executable, {...}, {out});
toLayout(host_tensor, out);

Some use cases that we want to consider, is that we'd like to be able to schedule work on the device, and then have the host prepare inputs for the next iteration and potentially even load the next set of inputs into device DRAM before the current workload has finished.

Let me know what you both think.

jnie-TT commented 1 month ago

Hey @nsmithtt this looks great! So I guess the flatbuffer won't have any tensor location info associated with the input output tensors, and it's up to the user to call the subsequent APIs to allocate these tensors as desired, and to stitch programs together they would just need create an Input tensor, and call toLayout on the output tensor of the previous program to the input tensor just created, is that correct?

jnie-TT commented 1 month ago

@nsmithtt do you think we could modify createTensor, and add a flag to specify the location:

Tensor createTensor(std::shared_ptr<void> data,                                 
                    std::vector<std::uint32_t> const &shape,                   
                    std::vector<std::uint32_t> const &stride,                  
                    std::uint32_t itemsize, ::tt::target::DataType dataType,
                    Location location);

Where the Location enum can define host or device l1/dram. This way the APIs are consistent, or else it seems weird that for createTensor the user needs to parse the executable and pass in the details, whereas in allocateInput/Output the user passes in the executable directly.

nsmithtt commented 1 month ago

Where the Location enum can define host or device l1/dram. This way the APIs are consistent, or else it seems weird that for createTensor the user needs to parse the executable and pass in the details, whereas in allocateInput/Output the user passes in the executable directly.

Hey @jnie-TT, I agree, it is a bit weird and it'd be good to make these APIs cleaner. So the reason this gets a bit tricky, is that creating a device tensor needs a lot of extra layout info, which is encoded in the flatbuffer binary, but we don't want the runtimes to be in the business of decoding a flatbuffer to get at this information. To the runtime the binary is just a blob.

tenstorrent / tt-mlir

Support `toLayout` API #103