repeat_interleave falls back to cpu

aguhaTT commented 6 months ago

repeat_interleave is one our biggest ops in mamba. We need it to work on device.

use cases:

1. repeat_interleave((1,1,32,5120),32,dim=3)
2. repeat_interleave((1,1,1,5120),32,dim=2)

jliangTT commented 6 months ago

assigning to the Naif to take a look.

@aguhaTT , how does this compare to #5428 in term of performance ROI?

ntarafdar commented 6 months ago

@aguhaTT , we have support for upsample. could we use that instead?

aguhaTT commented 6 months ago

@aguhaTT , we have support for upsample. could we use that instead?

will try it and report today

aguhaTT commented 6 months ago

upsample only implements row major layout. Our model is in tiled layout.

ntarafdar commented 6 months ago

ah I see, is that the only limitation? so if we were to add tiled support to upsample would that be sufficient. I'm trying to gauge either adding tilled support to this or adding a new repeat_interleave op from scratch

aguhaTT commented 6 months ago

i am getting the following functionality error with upsample: delta_t3_0 = ttnn.upsample(delta_t2_1, (1,1,1,32)) gives assert scale_c == 1, "scale_c should be 1"

ntarafdar commented 6 months ago

And this is after you converted your input to row-major? I can have a look at it.

On Tue, Mar 19, 2024 at 10:35 AM aguhaTT @.***> wrote:

i am getting the following functionality error with upsample: delta_t3_0 = ttnn.upsample(delta_t2_1, (1,1,1,32)) gives assert scale_c == 1, "scale_c should be 1"

— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/6516#issuecomment-2007767482, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAK3IAZOHW7TP7RXB4XNPTTYZBZOZAVCNFSM6AAAAABE4KNAQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG43DONBYGI . You are receiving this because you were assigned.Message ID: @.***>

aguhaTT commented 6 months ago

yes

ntarafdar commented 6 months ago

@mywoodstock , I think you worked on upsample before. We wanted to use up_sample as a way to do repeat_interleave and @aguhaTT ran into the bug above. Was wondering if it was an obvious fix? If not I can have an in-depth look at it myself.

mywoodstock commented 6 months ago

upsample only supports scaling on H and W, not C. Also, it expects input tensor in NHWC format. see: https://github.com/tenstorrent-metal/tt-metal/blob/main/ttnn/ttnn/operations/others.py#L357-L389

this was the original issue for upsample: https://github.com/tenstorrent-metal/tt-metal/issues/4487

ntarafdar commented 6 months ago

@aguhaTT @mywoodstock correct me if I'm wrong:

1. repeat_interleave((1,1,32,5120),32,dim=3)
2. repeat_interleave((1,1,1,5120),32,dim=2)

is scale on W
is scale on H

And if I'm correct then this should be sufficient.

aguhaTT commented 6 months ago

@aguhaTT @mywoodstock correct me if I'm wrong:
1. repeat_interleave((1,1,32,5120),32,dim=3)
2. repeat_interleave((1,1,1,5120),32,dim=2)
is scale on W

is scale on H

And if I'm correct then this should be sufficient.

yes that is my understanding too. I am not sure why I got the error.

ntarafdar commented 6 months ago

could you point us to your python code in how you're calling it?

aguhaTT commented 6 months ago

i will post a small test script.

aguhaTT commented 6 months ago

here is the test script to reproduce the error. Uploading as a zip file because github is not allowing to upload a python file. test_upsample.py.zip

mywoodstock commented 6 months ago

upsample only supports scaling on H and W, not C. Also, it expects input tensor in NHWC format. see: https://github.com/tenstorrent-metal/tt-metal/blob/main/ttnn/ttnn/operations/others.py#L357-L389

this was the original issue for upsample: #4487

@aguhaTT Please see my commect above. Upsample needs tensor to be in NHWC and only supported scaling on H and W. In the test you are scaling C.

aguhaTT commented 6 months ago

i will try this and see how it affects the perf - its not ideal for us to use row major or dim=1

mywoodstock commented 6 months ago

Yeah, upsample does not support TILE. If that is necessary, then it needs to be implemented. Also, using NHWC is a just a transpose, why is it not usable?

jliangTT commented 6 months ago

assigning to @mywoodstock to drive the upsample path - if it needs to be involve TM team @tarafdarTT / @yan-zaretskiy, please assign it back to me right away.

mywoodstock commented 6 months ago

Also, repeat_interleave exists in ttlib as a composite op. It might not have been uplifted to TTNN. https://github.com/tenstorrent-metal/tt-metal/blob/main/tt_eager/tt_dnn/op_library/composite/composite_ops.cpp#L754

aguhaTT commented 6 months ago

Yeah, upsample does not support TILE. If that is necessary, then it needs to be implemented. Also, using NHWC is a just a transpose, why is it not usable?

yes it is usable, but C=1 in our case, so padding will be involved and will have perf implications

mywoodstock commented 6 months ago

Yeah, upsample does not support TILE. If that is necessary, then it needs to be implemented. Also, using NHWC is a just a transpose, why is it not usable?

yes it is usable, but C=1 in our case, so padding will be involved and will have perf implications

I see, with ROW_MAJOR you don't normally need padding, but since its 1, i think minimum padding you would need in this case is to 16.

jliangTT commented 6 months ago

@aguhaTT , can you quantify the performance impact of using row_major + minimum padding to 16?

aguhaTT commented 6 months ago

@jliangTT I had several issues. So I can only give you this info tomorrow.

jliangTT commented 6 months ago

yeah feel free to pull the @mywoodstock into a slack chat thread or a quick call.

mywoodstock commented 6 months ago

@aguhaTT what issues did you get? I see that ttnn.repeat_interleave also exists, can you use that?

aguhaTT commented 6 months ago

Here is my test case for assessing accuracy as well as performance of upsample and repeat_interleave. upsample and repeat_interleave fallback give correct results, but have unacceptable latency of 14ms and 4 ms respectively. repeat_interleave with transposes (designed to run on device) gave incorrect results. Archive.zip

mywoodstock commented 6 months ago

Please see the following. Both repeat_interleave options give correct result:

    t = torch.randn((1, 1, 32, 5120), dtype=torch.bfloat16)

    ## golden
    torch_result = torch.repeat_interleave(t, (32), dim=3)

    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_orig = ttnn.permute(delta_orig, (3,0,1,2))
    repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=0)
    repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (1, 2, 3, 0))
    repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
    assert torch.allclose(torch_result, repeat_interleaved_output)

    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_orig = ttnn.permute(delta_orig, (0,3,1,2))
    repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=1)
    repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (0, 2, 3, 1))
    repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
    assert torch.allclose(torch_result, repeat_interleaved_output)

    ## fallback
    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_repeat = ttnn.repeat_interleave(delta_orig, 32, dim=3)
    delta_repeat_out = ttnn.to_torch(delta_repeat)
    print(delta_repeat_out.shape)
    assert torch.allclose(torch_result, delta_repeat_out)

aguhaTT commented 6 months ago

unfortunately, I am getting an assert fail at assert torch.allclose(torch_result, repeat_interleaved_output). I downloaded and built metal about an hour back. So I think I have the latest version. So I am not sure what is going on.

Traceback (most recent call last):
  File "models/experimental/mamba/tests/test_upsample.py", line 87, in <module>
    main()
  File "models/experimental/mamba/tests/test_upsample.py", line 83, in main
    f()
  File "models/experimental/mamba/tests/test_upsample.py", line 20, in f
    assert torch.allclose(torch_result, repeat_interleaved_output)
AssertionError
                 Device | INFO     | Closing user mode device drivers
                 Always | FATAL    | Process tear down with device 0 still active
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/impl/device/device.cpp:35: tt::exception
info:
Process tear down with device {} still active
0
backtrace:
 --- /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f87000a18a7]
 --- /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f87000a1a60]
 --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f870007f08a]
 --- python(_start+0x2e) [0x5da67e]

Aborted (core dumped)

test_upsample.py.zip

jliangTT commented 6 months ago

Please see the following. Both repeat_interleave options give correct result:

    t = torch.randn((1, 1, 32, 5120), dtype=torch.bfloat16)

    ## golden
    torch_result = torch.repeat_interleave(t, (32), dim=3)

    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_orig = ttnn.permute(delta_orig, (3,0,1,2))
    repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=0)
    repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (1, 2, 3, 0))
    repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
    assert torch.allclose(torch_result, repeat_interleaved_output)

    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_orig = ttnn.permute(delta_orig, (0,3,1,2))
    repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=1)
    repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (0, 2, 3, 1))
    repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
    assert torch.allclose(torch_result, repeat_interleaved_output)

    ## fallback
    delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
    delta_repeat = ttnn.repeat_interleave(delta_orig, 32, dim=3)
    delta_repeat_out = ttnn.to_torch(delta_repeat)
    print(delta_repeat_out.shape)
    assert torch.allclose(torch_result, delta_repeat_out)

@mywoodstock , are you able to reproduce the latency 14ms and 4 ms that @apalaguha saw?

@aguhaTT , do you mind checking in #tt-metal-developer on your assertion?

mywoodstock commented 6 months ago

unfortunately, I am getting an assert fail at assert torch.allclose(torch_result, repeat_interleaved_output). I downloaded and built metal about an hour back. So I think I have the latest version. So I am not sure what is going on.

Traceback (most recent call last):
  File "models/experimental/mamba/tests/test_upsample.py", line 87, in <module>
    main()
  File "models/experimental/mamba/tests/test_upsample.py", line 83, in main
    f()
  File "models/experimental/mamba/tests/test_upsample.py", line 20, in f
    assert torch.allclose(torch_result, repeat_interleaved_output)
AssertionError
                 Device | INFO     | Closing user mode device drivers
                 Always | FATAL    | Process tear down with device 0 still active
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/impl/device/device.cpp:35: tt::exception
info:
Process tear down with device {} still active
0
backtrace:
 --- /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f87000a18a7]
 --- /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f87000a1a60]
 --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f870007f08a]
 --- python(_start+0x2e) [0x5da67e]

Aborted (core dumped)

test_upsample.py.zip

@aguhaTT I tried the exact same code (just copy pasted) and it works fine. There might be some issue with your dev env?

2024-03-21 22:53:42.661 | INFO     | SiliconDriver   - Detected 1 PCI device : {0}
                 Always | DEBUG    | Initializing firmware
                 Always | DEBUG    | Waiting for firmware init complete
                 Always | DEBUG    | Firmware init complete
                     Op | DEBUG    | Launching Operation: "ttnn.from_torch" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = None
                     Op | DEBUG    |    arg_2 = None
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_device" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
                     Op | DEBUG    |    memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Pad" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([1, 32, 32, 5120])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::WH
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 5120, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::CN
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([5119, 0, 0, 31])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Using fast pack untilize.
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([5120, 1, 32, 32])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT16
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Concat" (device)
                     Op | DEBUG    | Composite Parents: {repeat_interleave}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = 1
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    1: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    2: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    3: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    4: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    5: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    6: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    7: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    8: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    9: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    10: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    11: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    12: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    13: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    14: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    15: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    16: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    17: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    18: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    19: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    20: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    21: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    22: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    23: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    24: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    25: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    26: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    27: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    28: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    29: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    30: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    31: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([163839, 0, 0, 31])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Using fast pack untilize.
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([163840, 1, 32, 32])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT16
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::CN
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::WH
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163809[163840], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Unpad" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([0, 0, 31, 163839])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 1[32], 163840]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_device" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_layout" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = Layout.ROW_MAJOR
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_torch" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) allclose" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_torch" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = None
                     Op | DEBUG    |    arg_2 = None
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_device" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
                     Op | DEBUG    |    memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Pad" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([1, 32, 32, 5120])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::WH
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 5120, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([0, 5119, 0, 31])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Using fast pack untilize.
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([1, 5120, 32, 32])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT16
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {repeat_interleave, permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Concat" (device)
                     Op | DEBUG    | Composite Parents: {repeat_interleave}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = 1
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    1: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    2: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    3: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    4: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    5: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    6: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    7: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    8: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    9: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    10: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    11: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    12: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    13: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    14: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    15: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    16: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    17: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    18: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    19: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    20: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    21: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    22: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    23: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    24: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    25: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    26: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    27: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    28: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    29: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    30: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |    31: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {repeat_interleave, permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {repeat_interleave, permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([0, 163839, 0, 31])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Using fast pack untilize.
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_shape = Shape([1, 163840, 32, 32])
                     Op | DEBUG    |    input_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    pad_value = 0
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    |    output_dtype = DataType::BFLOAT16
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::HC
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Transpose" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    dim = TransposeOpDim::WH
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163809[163840], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "tt::tt_metal::Unpad" (device)
                     Op | DEBUG    | Composite Parents: {permute}
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    output_tensor_start = Shape([0, 0, 0, 0])
                     Op | DEBUG    |    output_tensor_end = Shape([0, 0, 31, 163839])
                     Op | DEBUG    |    output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |    0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 1[32], 163840]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_device" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_layout" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = Layout.ROW_MAJOR
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_torch" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) allclose" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_torch" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = None
                     Op | DEBUG    |    arg_2 = None
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_device" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
                     Op | DEBUG    |    memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "torch_repeat_interleave" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = 32
                     Op | DEBUG    |    dim = 3
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_device" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_layout" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = Layout.ROW_MAJOR
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_torch" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) repeat_interleave" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = 32
                     Op | DEBUG    |    dim = 3
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) __get__" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) __get__" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_torch" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = DataType.BFLOAT16
                     Op | DEBUG    |    arg_2 = None
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) __get__" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) contiguous" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) __get__" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) numel" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) data_ptr" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_device" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
                     Op | DEBUG    |    memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.from_device" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_layout" (external)
                     Op | DEBUG    | Attributes:
                     Op | DEBUG    |    arg_1 = Layout.ROW_MAJOR
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "ttnn.to_torch" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
                     Op | DEBUG    | Launching Operation: "(torch) __get__" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
torch.Size([1, 1, 32, 163840])
                     Op | DEBUG    | Launching Operation: "(torch) allclose" (external)
                     Op | DEBUG    | Input Tensors:
                     Op | DEBUG    |
PASSED

=============================================================================================================================================================================================================================== PASSES ================================================================================================================================================================================================================================
======================================================================================================================================================================================================================= short test summary info =======================================================================================================================================================================================================================
PASSED tests/ttnn/unit_tests/operations/test_repeat_interleave.py::test_f2
========================================================================================================================================================================================================================= 1 passed in 18.95s ==========================================================================================================================================================================================================================

mywoodstock commented 6 months ago

@jliangTT I am not sure how these timing measurements are made. We work with the device time profiling.

jliangTT commented 6 months ago

what is device time profiling that you see in your example?

jliangTT commented 6 months ago

High bandwidth discussion happening in slack: https://tenstorrent.slack.com/archives/C06H3H4E2P8/p1711047993577729 (update: debugging environment issue to get some code examples working)

jliangTT commented 6 months ago

Update after a lot of debug. The team is making forward progress but still need to overcome a few road blocks:

lots of context here: https://tenstorrent.slack.com/archives/C06H3H4E2P8/p1711047993577729

there is an accuracy issue that only model team can repro. @mywoodstock cannot.
performance of the current repeat_interleave is terrible 4s and 5s. This is seen with using TTNN op but we don't expect this bad of device time using ttlib - this need a performance investigationg.
depending on (2), we probably need to either a) develop upsample that accept tile (row major has performance impact), b) need performant version of repeat_interleave implementation

Next step:

for (2), try to validate the current repeat_interleave without ttnn (by using ttnn.experimental.ttlib) to see if it is still too slow. 4s / 5s seems too much.
decide on (3) on the scope of the development
try to repro (1) and PCC - though i wonder if we can do the first two steps.

@tarafdarTT, will hand this back to you next week to continue the investigation.

mywoodstock commented 6 months ago

@aguhaTT Are you using grayskull or wormhole? I have been using grayskull -- this might be the difference in the observed result on the PCC error?

aguhaTT commented 6 months ago

using WH.

TT-BrianLiu commented 6 months ago

Why do we even need repeat_interleaved?

TT-BrianLiu commented 6 months ago

Did we review this op in yesterday's meeting? What was the conclusion?

ntarafdar commented 6 months ago

Did we review this op in yesterday's meeting? What was the conclusion?

Conclusion is that @apalaguha agreed that we can use upsample but they will still have limited performance due to upsample not having tiled support.

So I think first would be for @apalaguha to use upsample (my takeaway is that is not done yet?) , and in parallel I can fix upsample to support tiles

jliangTT commented 6 months ago

@apalaguha to use upsample (my takeaway is that is not done yet?) , and in parallel I can fix upsample to support tiles

@apalaguha did this last week but run into accuracy issue #6791, despite that, i think she has some result that you can review. Having said that, i feel like the "upsample to support tiles" is the feature work that we agree as the feature work to deliver.

ntarafdar commented 6 months ago

@apalaguha , I was just talking to @TT-BrianLiu , and wanted to discuss if we really need this op. Could we do a multi-cast to the cores that require the output of the repeat-interleave. This will also save buffer space of doing an entire op. We did something similar in falcon with a group attetnion matmul. The model had a repeat interleave, but we just mutli-cast the output.

TT-BrianLiu commented 6 months ago

@apalaguha , I was just talking to @TT-BrianLiu , and wanted to discuss if we really need this op. Could we do a multi-cast to the cores that require the output of the repeat-interleave. This will also save buffer space of doing an entire op. We did something similar in falcon with a group attetnion matmul. The model had a repeat interleave, but we just mutli-cast the output.

Who is consuming the repeat_interleaved data? If it's a batched matmul, can it be turned into a regular matmul with enough parallelization across devices?

aguhaTT commented 6 months ago

repeat_interleave result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .

kpaigwar commented 6 months ago

This is a series of Ops happening

A = (1, 1, B, E)
B = (1, 1, B, E*N)
A' = repeat_interleave(A, N, dim=3)
C = eltwise_mul(A', B)

kpaigwar commented 6 months ago

@tarafdarTT can you share the multi-cast example you are referring to? That can work for us if its multi-casting data in interleaved fashion

TT-BrianLiu commented 6 months ago

How is this currently implemented?

kpaigwar commented 6 months ago

https://github.com/tenstorrent-metal/tt-metal/blob/1d90cbed375742dbd8ac3d3b9fee71a774bf2499/models/experimental/mamba/tt_opt/mamba_one_step_ssm.py#L135

aguhaTT commented 6 months ago

repeat_interleave result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .

could it be possible to replace repeat_interleave + mul by a broadcast mul?

kpaigwar commented 6 months ago

repeat_interleave result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .

could it be possible to replace repeat_interleave + mul by a broadcast mul?

We tried that but it only broadcast the element at idx=0.

tenstorrent / tt-metal

repeat_interleave falls back to cpu #6516