tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
380 stars 47 forks source link

Create an intermediate tensor on the stack #10325

Open eyonland opened 1 month ago

eyonland commented 1 month ago

Problem: To handle the allocations of intermediate tensors within composite operations without causing fragmentation, it's crucial to emphasize the use of a stack for memory allocations. Allocating a memory pool from the heap doesn't easily allow for composing composite functions from other composite functions. When a compiler leverages an operation, it often needs to preallocate the output tensors before calling the operation. This allows the compiler to manage memory allocations on the L1 heap efficiently. However, if the operation makes new allocations on both the heap and the stack (due to circular buffers), it complicates the process as the compiler cannot assume a contiguous memory requirement for the operation given a specific set of arguments. Using a stack for these allocations simplifies the memory management by ensuring a predictable and contiguous memory requirement, thus facilitating the composition of composite functions without fragmentation.

Proposed Solution:

To simplify the requirements for intermediate tensors we propose that allocations happen the same way one would expect with a traditional function that requires memory on the stack. Specifically we are saying the intermediate tensors belong to the stack allocation alongside circular buffers. This design decision would allow us to provide a way for compilers to request memory requirements as one number of bytes needed from L1 for a given set of arguments. It would also allow us to more gracefully manage the intermediate tensors independently from the standard tensors used during model execution. Visualize here: https://excalidraw.com/#json=iNMQYSMrab3bxza0QFNp-,qwn7esoMB4rNjiXCJFke9g

The function create_scalar in ttnn/cpp/ttnn/operations/creation.hpp is currently allocating memory from the heap. The intent here is to update the code to allocate instead from the stack and update the offset of the circular buffers to account for this tensor allocation when an operation is called.

TOP OF L1 | heap allocated tensors | | circular buffers | | stack allocated tensors | L1 UNRESERVED BASE

abhullar-tt commented 1 month ago

Changes needed to allow allocations on the stack and have top of L1 be contiguous:

yan-zaretskiy commented 1 month ago

How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?

abhullar-tt commented 1 month ago

How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

davorchap commented 1 month ago

This would introduce classification of tensors into:

And also introduce memory regions of:

This is a less general feature than the memory pool. What makes nested composites in combination with the memory pool?

yan-zaretskiy commented 1 month ago

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.

abhullar-tt commented 1 month ago

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.

it would have to be done at runtime

yan-zaretskiy commented 1 month ago

it would have to be done at runtime

Right, doesn't that contradict your post above, specifically this?

These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run

abhullar-tt commented 1 month ago

it would have to be done at runtime

Right, doesn't that contradict your post above, specifically this?

These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run

It would have to be done post compilation but before enqueuing the program (same time we validate that CBs and top down allocated L1 buffers don't crash into each other)

davorchap commented 1 month ago

I think this referring to L1 special case, memory model is generic to DRAM as well.

abhullar-tt commented 1 month ago

I think this referring to L1 special case, memory model is generic to DRAM as well.

allocating on stack seems to be solving a different problem than what Moreh is looking for

eyonland commented 1 month ago

My argument is that these would only be used within a composite op instead of using a predetermined memory pool that has to be allocated upfront before a model is run. Using the intermediate tensors will require some re-writing of the composite ops but this will end up being better for us as they could reuse the intermediate tensors as optional output tensors in the composed op.

yan-zaretskiy commented 4 weeks ago

My hunch is that this is a somewhat wrong way to think about the problem. Placing tensors should be the work of an allocator, so we'd need to be able to provide an allocator instance to a tensor (HeapAllocator/StackAllocator) to control its memory placement. I don't think the placement should depend on the kind on an op we run (composite vs atomic).

More broadly, I think this issue needs more experimental investigation, in the sense that we should track the allocations of some real-world composite-of-composite ops and see what sort of fragmentation we observe, before we commit to working on additional allocators.