Closed punithsekar closed 2 months ago
@dvartaniansTT @punithsekar initially I thought it was a sharding issue but it looks like the op might just be using too much L1. its currently NOT sharded and trying to allocate all of it to a single L1. Is that intententional?
Hey @bbradelTT , I think @dvartaniansTT @punithsekar intended to use sharded L1. At the ttnn API level they used the ttnn.L1_MEMORY_CONFIG which is a interleaved L1 mem config. I am not sure if under the hood linear uses that to shard (because they also gave it a core grid separately).
Could we clean up the API to use the sharded memory config if the intention is to shard? And then fix this OOM issue assuming it's related to the buffer not being sharded properly.
@tarafdarTT if the user tells us to use a specific memory config for linear, we are not going to go behind their back and use a different one.
The user needs to specify the memory config that they want or not specify one and let linear base the decision on the inputs, although in that case, the inputs need to be set up properly to have the right memory config.
core_grid is just overwriting the device core grid and can be used to restrict which cores are used. Because of backwards compatibility it also changes the algorithm for automatically choosing a program config.
@tarafdarTT if the user tells us to use a specific memory config for linear, we are not going to go behind their back and use a different one.
The user needs to specify the memory config that they want or not specify one and let linear base the decision on the inputs, although in that case, the inputs need to be set up properly to have the right memory config.
core_grid is just overwriting the device core grid and can be used to restrict which cores are used. Because of backwards compatibility it also changes the algorithm for automatically choosing a program config.
I see so the issue is the incorrect memory config that is allocating too large of an L1 buffer. Sorry about that I was confused because of the interleaved memory config plus core grid.
@dvartaniansTT @punithsekar please update your memory config to a sharded memory config:
refer to tests/ttnn/unit_tests/operations/test_core.py
how to do that.
Also, I don't think it matters if the tensor is sharded or not.
33554432 B L1 buffer across 64 banks, where each bank needs to store 524288
indicates that it just won't fit in 64 cores.
You probably need to store the tensor(s) in DRAM.
Also, I don't think it matters if the tensor is sharded or not.
33554432 B L1 buffer across 64 banks, where each bank needs to store 524288
indicates that it just won't fit in 64 cores.
You probably need to store the tensor(s) in DRAM.
@punithsekar @dvartaniansTT
@punithsekar @dvartaniansTT
Why is the memory usage so large?
And where is the failure, is it in the permute?
@punithsekar @dvartaniansTT
I checked, and this does occur in permute.
> output = ttnn.permute(hidden_states, (0, 2, 1, 3))
a.py:28:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = FastOperation(python_fully_qualified_name='ttnn.permute', function=<ttnn._ttnn.operations.data_movement.permute_t obje...<function default_postprocess_golden_function_outputs at 0x7f19b60c3af0>, is_cpp_operation=True, is_experimental=False)
function_args = (ttnn.Tensor([[[[-6.50000, 1.86719, ..., 0.00000, 0.00000],
[ 0.00000, 0.00000, ..., 0.00000, 0....00000, 0.00000]]]], shape=Shape([1, 16384, 1[32], 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE), (0, 2, 1, 3))
function_kwargs = {}
def __call__(self, *function_args, **function_kwargs):
> return self.function(*function_args, **function_kwargs)
E RuntimeError: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
E info:
E Out of Memory: Not enough space to allocate 33554432 B L1 buffer across 64 banks, where each bank needs to store 524288 B
E backtrace:
You'll need to use DRAM. You might be able to just move the tensor over to DRAM before permute.
I think that's what the commented out lines do.
Describe the bug Facing OOM issue from permute in the segformer efficient selfattention pipeline for certain shapes. I have taken a snippet of the sub_module to reproduce the issue.
To Reproduce Steps to reproduce the behavior:
pytest models/experimental/functional_segformer/segformer_unit_test_1.py
Expected behavior The test should pass without OOM issue
Screenshots
Please complete the following environment information:
Additional context If we remove the tensor from device and then keep on device, we are not facing that issue.
Test passes without any issue if we uncomment line 22 & 26 in file
segformer_unit_test_1.py
CC: @saichandax