103 Runtime-Stitching

First Iteration/Prototype:

Added toLayout API that takes in a device, binary, program index, input index, and input tensor, and returns a tensor with the layout converted to the memory descriptor describing this input tensor in the binary.
- This gives the user the ability to hold a handle of a tensor that could be used in multiple program preceding runs. For example the weights of the model when running inference. The user could for example use toLayout to receive a tensor handle of the weights on device, and pass this handle to all subsequent forward runs.
Updated runProgram to implicitly convert inputs/outputs to the desired layout described in the binary.
Submit now returns a vector of tensors instead of accepting output containers from the user. Each tensor now has an event that can be waited on (not implemented yet).
- With this, the user is now responsible for deallocating these tensors. I added a brief deallocate API, maybe we could add this to the destructor of the Tensor class.
- These tensors returned could reside anywhere (host, device dram, device l1) and could have different layouts/memory configs, all according to the MemoryDesc in the flatbuffer.

TODOs:

Add metal support. Currently just added support in ttnn as a prototype.
Polish tensor lifetimes, events, allocate/deallocate. I'm not super familiar with how tensors are deallocated in metal, I can look into it more. But I imagine this can get complicated as we introduce async execution/events and even multi device, so it would be great if we could come up with a clean routine from the start.
Testing. Currently I haven't run any tests with this routine yet, just want to get input on whether the overall structure/implementation makes sense to stakeholders. Once we come up with a finalized prototype I'll update runtime tests/ttrt and test with existing flatbuffers.

Please let me know what you think, any suggestions are appreciated!

tenstorrent / tt-mlir