Create an intermediate tensor on the stack

eyonland commented 4 months ago

Problem: To handle the allocations of intermediate tensors within composite operations without causing fragmentation, it's crucial to emphasize the use of a stack for memory allocations. Allocating a memory pool from the heap doesn't easily allow for composing composite functions from other composite functions. When a compiler leverages an operation, it often needs to preallocate the output tensors before calling the operation. This allows the compiler to manage memory allocations on the L1 heap efficiently. However, if the operation makes new allocations on both the heap and the stack (due to circular buffers), it complicates the process as the compiler cannot assume a contiguous memory requirement for the operation given a specific set of arguments. Using a stack for these allocations simplifies the memory management by ensuring a predictable and contiguous memory requirement, thus facilitating the composition of composite functions without fragmentation.

Proposed Solution:

To simplify the requirements for intermediate tensors we propose that allocations happen the same way one would expect with a traditional function that requires memory on the stack. Specifically we are saying the intermediate tensors belong to the stack allocation alongside circular buffers. This design decision would allow us to provide a way for compilers to request memory requirements as one number of bytes needed from L1 for a given set of arguments. It would also allow us to more gracefully manage the intermediate tensors independently from the standard tensors used during model execution. Visualize here: https://excalidraw.com/#json=iNMQYSMrab3bxza0QFNp-,qwn7esoMB4rNjiXCJFke9g

The function create_scalar in ttnn/cpp/ttnn/operations/creation.hpp is currently allocating memory from the heap. The intent here is to update the code to allocate instead from the stack and update the offset of the circular buffers to account for this tensor allocation when an operation is called.

abhullar-tt commented 3 months ago

Changes needed to allow allocations on the stack and have top of L1 be contiguous:

Allow Buffers to specify whether they need to be allocated top down/bottom up
Update CircularBuffer allocation to be assigned at relative addresses. These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run

yan-zaretskiy commented 3 months ago

How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?

abhullar-tt commented 3 months ago

How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

davorchap commented 3 months ago

This would introduce classification of tensors into:

intermediates
"other"

And also introduce memory regions of:

stack
heap

This is a less general feature than the memory pool. What makes nested composites in combination with the memory pool?

yan-zaretskiy commented 3 months ago

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.

abhullar-tt commented 3 months ago

the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem

That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.

it would have to be done at runtime

yan-zaretskiy commented 3 months ago

it would have to be done at runtime

Right, doesn't that contradict your post above, specifically this?

These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run

abhullar-tt commented 3 months ago

it would have to be done at runtime

Right, doesn't that contradict your post above, specifically this?

These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run

It would have to be done post compilation but before enqueuing the program (same time we validate that CBs and top down allocated L1 buffers don't crash into each other)

davorchap commented 3 months ago

I think this referring to L1 special case, memory model is generic to DRAM as well.

abhullar-tt commented 3 months ago

I think this referring to L1 special case, memory model is generic to DRAM as well.

allocating on stack seems to be solving a different problem than what Moreh is looking for

eyonland commented 3 months ago

My argument is that these would only be used within a composite op instead of using a predetermined memory pool that has to be allocated upfront before a model is run. Using the intermediate tensors will require some re-writing of the composite ops but this will end up being better for us as they could reuse the intermediate tensors as optional output tensors in the composed op.

yan-zaretskiy commented 3 months ago

My hunch is that this is a somewhat wrong way to think about the problem. Placing tensors should be the work of an allocator, so we'd need to be able to provide an allocator instance to a tensor (HeapAllocator/StackAllocator) to control its memory placement. I don't think the placement should depend on the kind on an op we run (composite vs atomic).

More broadly, I think this issue needs more experimental investigation, in the sense that we should track the allocations of some real-world composite-of-composite ops and see what sort of fragmentation we observe, before we commit to working on additional allocators.

tenstorrent / tt-metal

Create an intermediate tensor on the stack #10325