tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 74 forks source link

Runtime Error: OOM issue in segformer efficient selfattention pipeline #11695

Closed punithsekar closed 2 months ago

punithsekar commented 2 months ago

Describe the bug Facing OOM issue from permute in the segformer efficient selfattention pipeline for certain shapes. I have taken a snippet of the sub_module to reproduce the issue.

To Reproduce Steps to reproduce the behavior:

  1. Checkout to branch, punith/segformer_issues
  2. Run command, pytest models/experimental/functional_segformer/segformer_unit_test_1.py

Expected behavior The test should pass without OOM issue

Screenshots

E       RuntimeError: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
E       info:
E       Out of Memory: Not enough space to allocate 33554432 B L1 buffer across 64 banks, where each bank needs to store 524288 B

Please complete the following environment information:

Additional context If we remove the tensor from device and then keep on device, we are not facing that issue.

Test passes without any issue if we uncomment line 22 & 26 in file segformer_unit_test_1.py

CC: @saichandax

ntarafdar commented 2 months ago

@dvartaniansTT @punithsekar initially I thought it was a sharding issue but it looks like the op might just be using too much L1. its currently NOT sharded and trying to allocate all of it to a single L1. Is that intententional?

ntarafdar commented 2 months ago

Hey @bbradelTT , I think @dvartaniansTT @punithsekar intended to use sharded L1. At the ttnn API level they used the ttnn.L1_MEMORY_CONFIG which is a interleaved L1 mem config. I am not sure if under the hood linear uses that to shard (because they also gave it a core grid separately).

Could we clean up the API to use the sharded memory config if the intention is to shard? And then fix this OOM issue assuming it's related to the buffer not being sharded properly.

bbradelTT commented 2 months ago

@tarafdarTT if the user tells us to use a specific memory config for linear, we are not going to go behind their back and use a different one.

The user needs to specify the memory config that they want or not specify one and let linear base the decision on the inputs, although in that case, the inputs need to be set up properly to have the right memory config.

core_grid is just overwriting the device core grid and can be used to restrict which cores are used. Because of backwards compatibility it also changes the algorithm for automatically choosing a program config.

ntarafdar commented 2 months ago

@tarafdarTT if the user tells us to use a specific memory config for linear, we are not going to go behind their back and use a different one.

The user needs to specify the memory config that they want or not specify one and let linear base the decision on the inputs, although in that case, the inputs need to be set up properly to have the right memory config.

core_grid is just overwriting the device core grid and can be used to restrict which cores are used. Because of backwards compatibility it also changes the algorithm for automatically choosing a program config.

I see so the issue is the incorrect memory config that is allocating too large of an L1 buffer. Sorry about that I was confused because of the interleaved memory config plus core grid.

@dvartaniansTT @punithsekar please update your memory config to a sharded memory config: refer to tests/ttnn/unit_tests/operations/test_core.py how to do that.

bbradelTT commented 2 months ago

Also, I don't think it matters if the tensor is sharded or not.

33554432 B L1 buffer across 64 banks, where each bank needs to store 524288

indicates that it just won't fit in 64 cores.

You probably need to store the tensor(s) in DRAM.

ntarafdar commented 2 months ago

Also, I don't think it matters if the tensor is sharded or not.

33554432 B L1 buffer across 64 banks, where each bank needs to store 524288

indicates that it just won't fit in 64 cores.

You probably need to store the tensor(s) in DRAM.

@punithsekar @dvartaniansTT

bbradelTT commented 2 months ago

@punithsekar @dvartaniansTT

Why is the memory usage so large?

And where is the failure, is it in the permute?

bbradelTT commented 2 months ago

@punithsekar @dvartaniansTT

I checked, and this does occur in permute.

>       output = ttnn.permute(hidden_states, (0, 2, 1, 3))

a.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = FastOperation(python_fully_qualified_name='ttnn.permute', function=<ttnn._ttnn.operations.data_movement.permute_t obje...<function default_postprocess_golden_function_outputs at 0x7f19b60c3af0>, is_cpp_operation=True, is_experimental=False)
function_args = (ttnn.Tensor([[[[-6.50000,  1.86719,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0....00000,  0.00000]]]], shape=Shape([1, 16384, 1[32], 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE), (0, 2, 1, 3))
function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_THROW @ ../tt_metal/impl/allocator/allocator.cpp:141: tt::exception
E       info:
E       Out of Memory: Not enough space to allocate 33554432 B L1 buffer across 64 banks, where each bank needs to store 524288 B
E       backtrace:

You'll need to use DRAM. You might be able to just move the tensor over to DRAM before permute.

I think that's what the commented out lines do.