Added toLayout API that takes in a device, binary, program index, input index, and input tensor, and returns a tensor with the layout converted to the memory descriptor describing this input tensor in the binary.
This gives the user the ability to hold a handle of a tensor that could be used in multiple program preceding runs. For example the weights of the model when running inference. The user could for example use toLayout to receive a tensor handle of the weights on device, and pass this handle to all subsequent forward runs.
Updated runProgram to implicitly convert inputs/outputs to the desired layout described in the binary.
Submit now returns a vector of tensors instead of accepting output containers from the user. Each tensor now has an event that can be waited on (not implemented yet).
With this, the user is now responsible for deallocating these tensors. I added a brief deallocate API, maybe we could add this to the destructor of the Tensor class.
These tensors returned could reside anywhere (host, device dram, device l1) and could have different layouts/memory configs, all according to the MemoryDesc in the flatbuffer.
TODOs:
Add metal support. Currently just added support in ttnn as a prototype.
Polish tensor lifetimes, events, allocate/deallocate. I'm not super familiar with how tensors are deallocated in metal, I can look into it more. But I imagine this can get complicated as we introduce async execution/events and even multi device, so it would be great if we could come up with a clean routine from the start.
Testing. Currently I haven't run any tests with this routine yet, just want to get input on whether the overall structure/implementation makes sense to stakeholders. Once we come up with a finalized prototype I'll update runtime tests/ttrt and test with existing flatbuffers.
Please let me know what you think, any suggestions are appreciated!
103 Runtime-Stitching
First Iteration/Prototype:
TODOs:
Please let me know what you think, any suggestions are appreciated!