Closed aguhaTT closed 5 months ago
assigning to the Naif to take a look.
@aguhaTT , how does this compare to #5428 in term of performance ROI?
@aguhaTT , we have support for upsample. could we use that instead?
@aguhaTT , we have support for upsample. could we use that instead?
will try it and report today
upsample
only implements row major layout. Our model is in tiled layout.
ah I see, is that the only limitation? so if we were to add tiled support to upsample would that be sufficient. I'm trying to gauge either adding tilled support to this or adding a new repeat_interleave op from scratch
i am getting the following functionality error with upsample:
delta_t3_0 = ttnn.upsample(delta_t2_1, (1,1,1,32))
gives
assert scale_c == 1, "scale_c should be 1"
And this is after you converted your input to row-major? I can have a look at it.
On Tue, Mar 19, 2024 at 10:35 AM aguhaTT @.***> wrote:
i am getting the following functionality error with upsample: delta_t3_0 = ttnn.upsample(delta_t2_1, (1,1,1,32)) gives assert scale_c == 1, "scale_c should be 1"
— Reply to this email directly, view it on GitHub https://github.com/tenstorrent-metal/tt-metal/issues/6516#issuecomment-2007767482, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAK3IAZOHW7TP7RXB4XNPTTYZBZOZAVCNFSM6AAAAABE4KNAQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG43DONBYGI . You are receiving this because you were assigned.Message ID: @.***>
yes
@mywoodstock , I think you worked on upsample before. We wanted to use up_sample as a way to do repeat_interleave and @aguhaTT ran into the bug above. Was wondering if it was an obvious fix? If not I can have an in-depth look at it myself.
upsample
only supports scaling on H and W, not C. Also, it expects input tensor in NHWC
format.
see: https://github.com/tenstorrent-metal/tt-metal/blob/main/ttnn/ttnn/operations/others.py#L357-L389
this was the original issue for upsample: https://github.com/tenstorrent-metal/tt-metal/issues/4487
@aguhaTT @mywoodstock correct me if I'm wrong:
1. repeat_interleave((1,1,32,5120),32,dim=3)
2. repeat_interleave((1,1,1,5120),32,dim=2)
And if I'm correct then this should be sufficient.
@aguhaTT @mywoodstock correct me if I'm wrong:
1. repeat_interleave((1,1,32,5120),32,dim=3) 2. repeat_interleave((1,1,1,5120),32,dim=2)
- is scale on W
- is scale on H
And if I'm correct then this should be sufficient.
yes that is my understanding too. I am not sure why I got the error.
could you point us to your python code in how you're calling it?
i will post a small test script.
here is the test script to reproduce the error. Uploading as a zip file because github is not allowing to upload a python file. test_upsample.py.zip
upsample
only supports scaling on H and W, not C. Also, it expects input tensor inNHWC
format. see: https://github.com/tenstorrent-metal/tt-metal/blob/main/ttnn/ttnn/operations/others.py#L357-L389this was the original issue for upsample: #4487
@aguhaTT Please see my commect above. Upsample needs tensor to be in NHWC and only supported scaling on H and W. In the test you are scaling C.
i will try this and see how it affects the perf - its not ideal for us to use row major or dim=1
Yeah, upsample does not support TILE
. If that is necessary, then it needs to be implemented.
Also, using NHWC is a just a transpose, why is it not usable?
assigning to @mywoodstock to drive the upsample path - if it needs to be involve TM team @tarafdarTT / @yan-zaretskiy, please assign it back to me right away.
Also, repeat_interleave
exists in ttlib as a composite op. It might not have been uplifted to TTNN.
https://github.com/tenstorrent-metal/tt-metal/blob/main/tt_eager/tt_dnn/op_library/composite/composite_ops.cpp#L754
Yeah, upsample does not support
TILE
. If that is necessary, then it needs to be implemented. Also, using NHWC is a just a transpose, why is it not usable?
yes it is usable, but C=1 in our case, so padding will be involved and will have perf implications
Yeah, upsample does not support
TILE
. If that is necessary, then it needs to be implemented. Also, using NHWC is a just a transpose, why is it not usable?yes it is usable, but C=1 in our case, so padding will be involved and will have perf implications
I see, with ROW_MAJOR you don't normally need padding, but since its 1
, i think minimum padding you would need in this case is to 16.
@aguhaTT , can you quantify the performance impact of using row_major + minimum padding to 16?
@jliangTT I had several issues. So I can only give you this info tomorrow.
yeah feel free to pull the @mywoodstock into a slack chat thread or a quick call.
@aguhaTT what issues did you get?
I see that ttnn.repeat_interleave
also exists, can you use that?
Here is my test case for assessing accuracy as well as performance of upsample
and repeat_interleave
. upsample
and repeat_interleave
fallback give correct results, but have unacceptable latency of 14ms
and 4 ms
respectively. repeat_interleave
with transposes (designed to run on device) gave incorrect results.
Archive.zip
Please see the following. Both repeat_interleave
options give correct result:
t = torch.randn((1, 1, 32, 5120), dtype=torch.bfloat16)
## golden
torch_result = torch.repeat_interleave(t, (32), dim=3)
delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
delta_orig = ttnn.permute(delta_orig, (3,0,1,2))
repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=0)
repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (1, 2, 3, 0))
repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
assert torch.allclose(torch_result, repeat_interleaved_output)
delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
delta_orig = ttnn.permute(delta_orig, (0,3,1,2))
repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=1)
repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (0, 2, 3, 1))
repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
assert torch.allclose(torch_result, repeat_interleaved_output)
## fallback
delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
delta_repeat = ttnn.repeat_interleave(delta_orig, 32, dim=3)
delta_repeat_out = ttnn.to_torch(delta_repeat)
print(delta_repeat_out.shape)
assert torch.allclose(torch_result, delta_repeat_out)
unfortunately, I am getting an assert fail at assert torch.allclose(torch_result, repeat_interleaved_output)
. I downloaded and built metal about an hour back. So I think I have the latest version. So I am not sure what is going on.
Traceback (most recent call last):
File "models/experimental/mamba/tests/test_upsample.py", line 87, in <module>
main()
File "models/experimental/mamba/tests/test_upsample.py", line 83, in main
f()
File "models/experimental/mamba/tests/test_upsample.py", line 20, in f
assert torch.allclose(torch_result, repeat_interleaved_output)
AssertionError
Device | INFO | Closing user mode device drivers
Always | FATAL | Process tear down with device 0 still active
terminate called after throwing an instance of 'std::runtime_error'
what(): TT_THROW @ tt_metal/impl/device/device.cpp:35: tt::exception
info:
Process tear down with device {} still active
0
backtrace:
--- /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f87000a18a7]
--- /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f87000a1a60]
--- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f870007f08a]
--- python(_start+0x2e) [0x5da67e]
Aborted (core dumped)
Please see the following. Both
repeat_interleave
options give correct result:t = torch.randn((1, 1, 32, 5120), dtype=torch.bfloat16) ## golden torch_result = torch.repeat_interleave(t, (32), dim=3) delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG) delta_orig = ttnn.permute(delta_orig, (3,0,1,2)) repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=0) repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (1, 2, 3, 0)) repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output) assert torch.allclose(torch_result, repeat_interleaved_output) delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG) delta_orig = ttnn.permute(delta_orig, (0,3,1,2)) repeat_interleaved_output = ttnn.repeat_interleave(delta_orig, 32, dim=1) repeat_interleaved_output = ttnn.permute(repeat_interleaved_output, (0, 2, 3, 1)) repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output) assert torch.allclose(torch_result, repeat_interleaved_output) ## fallback delta_orig = ttnn.from_torch(t, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG) delta_repeat = ttnn.repeat_interleave(delta_orig, 32, dim=3) delta_repeat_out = ttnn.to_torch(delta_repeat) print(delta_repeat_out.shape) assert torch.allclose(torch_result, delta_repeat_out)
@mywoodstock , are you able to reproduce the latency 14ms and 4 ms that @apalaguha saw?
@aguhaTT , do you mind checking in #tt-metal-developer on your assertion?
unfortunately, I am getting an assert fail at
assert torch.allclose(torch_result, repeat_interleaved_output)
. I downloaded and built metal about an hour back. So I think I have the latest version. So I am not sure what is going on.Traceback (most recent call last): File "models/experimental/mamba/tests/test_upsample.py", line 87, in <module> main() File "models/experimental/mamba/tests/test_upsample.py", line 83, in main f() File "models/experimental/mamba/tests/test_upsample.py", line 20, in f assert torch.allclose(torch_result, repeat_interleaved_output) AssertionError Device | INFO | Closing user mode device drivers Always | FATAL | Process tear down with device 0 still active terminate called after throwing an instance of 'std::runtime_error' what(): TT_THROW @ tt_metal/impl/device/device.cpp:35: tt::exception info: Process tear down with device {} still active 0 backtrace: --- /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f87000a18a7] --- /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f87000a1a60] --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f870007f08a] --- python(_start+0x2e) [0x5da67e] Aborted (core dumped)
@aguhaTT I tried the exact same code (just copy pasted) and it works fine. There might be some issue with your dev env?
2024-03-21 22:53:42.661 | INFO | SiliconDriver - Detected 1 PCI device : {0}
Always | DEBUG | Initializing firmware
Always | DEBUG | Waiting for firmware init complete
Always | DEBUG | Firmware init complete
Op | DEBUG | Launching Operation: "ttnn.from_torch" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = None
Op | DEBUG | arg_2 = None
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_device" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
Op | DEBUG | memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Pad" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([1, 32, 32, 5120])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::WH
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 5120, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::CN
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([5119, 0, 0, 31])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Using fast pack untilize.
Op | DEBUG | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([5120, 1, 32, 32])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT16
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Concat" (device)
Op | DEBUG | Composite Parents: {repeat_interleave}
Op | DEBUG | Attributes:
Op | DEBUG | dim = 1
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 1: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 2: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 3: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 4: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 5: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 6: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 7: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 8: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 9: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 10: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 11: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 12: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 13: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 14: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 15: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 16: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 17: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 18: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 19: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 20: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 21: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 22: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 23: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 24: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 25: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 26: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 27: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 28: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 29: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 30: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 31: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([5120, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([163839, 0, 0, 31])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Using fast pack untilize.
Op | DEBUG | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([163840, 1, 32, 32])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT16
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::CN
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([163840, 1, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::WH
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163809[163840], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Unpad" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([0, 0, 31, 163839])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 1[32], 163840]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_device" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_layout" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = Layout.ROW_MAJOR
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_torch" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) allclose" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_torch" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = None
Op | DEBUG | arg_2 = None
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_device" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
Op | DEBUG | memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Pad" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([1, 32, 32, 5120])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::WH
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 32, 5120]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 5120, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([0, 5119, 0, 31])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 32, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Using fast pack untilize.
Op | DEBUG | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([1, 5120, 32, 32])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT16
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {repeat_interleave, permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 5120, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Concat" (device)
Op | DEBUG | Composite Parents: {repeat_interleave}
Op | DEBUG | Attributes:
Op | DEBUG | dim = 1
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 1: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 2: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 3: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 4: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 5: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 6: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 7: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 8: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 9: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 10: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 11: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 12: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 13: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 14: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 15: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 16: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 17: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 18: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 19: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 20: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 21: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 22: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 23: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 24: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 25: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 26: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 27: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 28: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 29: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 30: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG | 31: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {repeat_interleave, permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {repeat_interleave, permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163840, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::UntilizeWithUnpadding" (device)
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([0, 163839, 0, 31])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Using fast pack untilize.
Op | DEBUG | Launching Operation: "tt::tt_metal::TilizeWithValPadding" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_shape = Shape([1, 163840, 32, 32])
Op | DEBUG | input_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | pad_value = 0
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | output_dtype = DataType::BFLOAT16
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1, 32]),dtype=DataType::BFLOAT16,layout=Layout::ROW_MAJOR)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::HC
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 163840, 1[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Transpose" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | dim = TransposeOpDim::WH
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 163809[163840], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "tt::tt_metal::Unpad" (device)
Op | DEBUG | Composite Parents: {permute}
Op | DEBUG | Attributes:
Op | DEBUG | output_tensor_start = Shape([0, 0, 0, 0])
Op | DEBUG | output_tensor_end = Shape([0, 0, 31, 163839])
Op | DEBUG | output_mem_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG | 0: tt::tt_metal::Tensor(storage=tt::tt_metal::DeviceStorage(memory_config=tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 32, 1[32], 163840]),dtype=DataType::BFLOAT16,layout=Layout::TILE)
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_device" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_layout" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = Layout.ROW_MAJOR
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_torch" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) allclose" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_torch" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = None
Op | DEBUG | arg_2 = None
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_device" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
Op | DEBUG | memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "torch_repeat_interleave" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = 32
Op | DEBUG | dim = 3
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_device" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_layout" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = Layout.ROW_MAJOR
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_torch" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) repeat_interleave" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = 32
Op | DEBUG | dim = 3
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) __get__" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) __get__" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_torch" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = DataType.BFLOAT16
Op | DEBUG | arg_2 = None
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) __get__" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) contiguous" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) __get__" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) numel" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) data_ptr" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_device" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = <tt_lib.device.Device object at 0x7f60063e16f0>
Op | DEBUG | memory_config = tt::tt_metal::MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.from_device" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_layout" (external)
Op | DEBUG | Attributes:
Op | DEBUG | arg_1 = Layout.ROW_MAJOR
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "ttnn.to_torch" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
Op | DEBUG | Launching Operation: "(torch) __get__" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
torch.Size([1, 1, 32, 163840])
Op | DEBUG | Launching Operation: "(torch) allclose" (external)
Op | DEBUG | Input Tensors:
Op | DEBUG |
PASSED
=============================================================================================================================================================================================================================== PASSES ================================================================================================================================================================================================================================
======================================================================================================================================================================================================================= short test summary info =======================================================================================================================================================================================================================
PASSED tests/ttnn/unit_tests/operations/test_repeat_interleave.py::test_f2
========================================================================================================================================================================================================================= 1 passed in 18.95s ==========================================================================================================================================================================================================================
@jliangTT I am not sure how these timing measurements are made. We work with the device time profiling.
what is device time profiling that you see in your example?
High bandwidth discussion happening in slack: https://tenstorrent.slack.com/archives/C06H3H4E2P8/p1711047993577729 (update: debugging environment issue to get some code examples working)
Update after a lot of debug. The team is making forward progress but still need to overcome a few road blocks:
lots of context here: https://tenstorrent.slack.com/archives/C06H3H4E2P8/p1711047993577729
Next step:
@tarafdarTT, will hand this back to you next week to continue the investigation.
@aguhaTT Are you using grayskull or wormhole? I have been using grayskull -- this might be the difference in the observed result on the PCC error?
using WH.
Why do we even need repeat_interleaved
?
Did we review this op in yesterday's meeting? What was the conclusion?
Did we review this op in yesterday's meeting? What was the conclusion?
Conclusion is that @apalaguha agreed that we can use upsample but they will still have limited performance due to upsample not having tiled support.
So I think first would be for @apalaguha to use upsample (my takeaway is that is not done yet?) , and in parallel I can fix upsample to support tiles
@apalaguha to use upsample (my takeaway is that is not done yet?) , and in parallel I can fix upsample to support tiles
@apalaguha did this last week but run into accuracy issue #6791, despite that, i think she has some result that you can review. Having said that, i feel like the "upsample to support tiles" is the feature work that we agree as the feature work to deliver.
@apalaguha , I was just talking to @TT-BrianLiu , and wanted to discuss if we really need this op. Could we do a multi-cast to the cores that require the output of the repeat-interleave. This will also save buffer space of doing an entire op. We did something similar in falcon with a group attetnion matmul. The model had a repeat interleave, but we just mutli-cast the output.
@apalaguha , I was just talking to @TT-BrianLiu , and wanted to discuss if we really need this op. Could we do a multi-cast to the cores that require the output of the repeat-interleave. This will also save buffer space of doing an entire op. We did something similar in falcon with a group attetnion matmul. The model had a repeat interleave, but we just mutli-cast the output.
Who is consuming the repeat_interleaved data? If it's a batched matmul, can it be turned into a regular matmul with enough parallelization across devices?
repeat_interleave
result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .
This is a series of Ops happening
A = (1, 1, B, E)
B = (1, 1, B, E*N)
A' = repeat_interleave(A, N, dim=3)
C = eltwise_mul(A', B)
@tarafdarTT can you share the multi-cast example you are referring to? That can work for us if its multi-casting data in interleaved fashion
How is this currently implemented?
repeat_interleave
result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .
could it be possible to replace repeat_interleave + mul
by a broadcast mul?
repeat_interleave
result is consumed by elementwise mul. Adding @kpaigwar and @esmalTT .could it be possible to replace
repeat_interleave + mul
by a broadcast mul?
We tried that but it only broadcast the element at idx=0.
repeat_interleave
is one our biggest ops in mamba. We need it to work on device.use cases: