Open eyonland opened 4 months ago
Changes needed to allow allocations on the stack and have top of L1 be contiguous:
How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?
How would you know the CB offset before running the program? If we have, say, an interleaved Tensor in L1, then one needs to know the number of pages and the core grid it's on to figure out its size on the "stack". The core grid is static, I assume, for a given device. But the number of pages as a function of shape is a runtime value, no?
the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem
This would introduce classification of tensors into:
And also introduce memory regions of:
This is a less general feature than the memory pool. What makes nested composites in combination with the memory pool?
the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem
That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.
the stack allocated buffers would be tracked by the allocator, so we could query the allocator to figure out how much space has been allocated from L1_UNRESERVED_BASE to top of stack allocated chunk of mem
That part is clear, I am still missing how this can be done at compile-time of the program. We don't allocate tensors at compile-time, or my understanding is wrong.
it would have to be done at runtime
it would have to be done at runtime
Right, doesn't that contradict your post above, specifically this?
These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run
it would have to be done at runtime
Right, doesn't that contradict your post above, specifically this?
These addresses would need to be offset by the amount of mem taken by stack allocated buffers before a program is run
It would have to be done post compilation but before enqueuing the program (same time we validate that CBs and top down allocated L1 buffers don't crash into each other)
I think this referring to L1 special case, memory model is generic to DRAM as well.
I think this referring to L1 special case, memory model is generic to DRAM as well.
allocating on stack seems to be solving a different problem than what Moreh is looking for
My argument is that these would only be used within a composite op instead of using a predetermined memory pool that has to be allocated upfront before a model is run. Using the intermediate tensors will require some re-writing of the composite ops but this will end up being better for us as they could reuse the intermediate tensors as optional output tensors in the composed op.
My hunch is that this is a somewhat wrong way to think about the problem. Placing tensors should be the work of an allocator, so we'd need to be able to provide an allocator instance to a tensor (HeapAllocator
/StackAllocator
) to control its memory placement. I don't think the placement should depend on the kind on an op we run (composite vs atomic).
More broadly, I think this issue needs more experimental investigation, in the sense that we should track the allocations of some real-world composite-of-composite ops and see what sort of fragmentation we observe, before we commit to working on additional allocators.
Problem: To handle the allocations of intermediate tensors within composite operations without causing fragmentation, it's crucial to emphasize the use of a stack for memory allocations. Allocating a memory pool from the heap doesn't easily allow for composing composite functions from other composite functions. When a compiler leverages an operation, it often needs to preallocate the output tensors before calling the operation. This allows the compiler to manage memory allocations on the L1 heap efficiently. However, if the operation makes new allocations on both the heap and the stack (due to circular buffers), it complicates the process as the compiler cannot assume a contiguous memory requirement for the operation given a specific set of arguments. Using a stack for these allocations simplifies the memory management by ensuring a predictable and contiguous memory requirement, thus facilitating the composition of composite functions without fragmentation.
Proposed Solution:
To simplify the requirements for intermediate tensors we propose that allocations happen the same way one would expect with a traditional function that requires memory on the stack. Specifically we are saying the intermediate tensors belong to the stack allocation alongside circular buffers. This design decision would allow us to provide a way for compilers to request memory requirements as one number of bytes needed from L1 for a given set of arguments. It would also allow us to more gracefully manage the intermediate tensors independently from the standard tensors used during model execution. Visualize here: https://excalidraw.com/#json=iNMQYSMrab3bxza0QFNp-,qwn7esoMB4rNjiXCJFke9g
The function create_scalar in ttnn/cpp/ttnn/operations/creation.hpp is currently allocating memory from the heap. The intent here is to update the code to allocate instead from the stack and update the offset of the circular buffers to account for this tensor allocation when an operation is called.
TOP OF L1 | heap allocated tensors | | circular buffers | | stack allocated tensors | L1 UNRESERVED BASE