tenstorrent / pytorch2.0_ttnn

⭐️ TTNN Compiler for PyTorch 2.0 ⭐️ It enables running PyTorch2.0 models on Tenstorrent hardware
https://tenstorrent.github.io/tt-metal/latest/ttnn/
25 stars 5 forks source link

non-interleaved buffers page size 38 not equal buffer size 720 #361

Open swimdi opened 3 days ago

swimdi commented 3 days ago

When I debug XGLM model test, has this error message

self = FastOperation(python_fully_qualified_name='ttnn.to_device', function=<built-in method to_device of PyCapsule object at...function default_postprocess_golden_function_outputs at 0x7fc169242ca0>, is_cpp_operation=False, is_experimental=False)
function_args = (ttnn.Tensor([[[[ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
               [ 0.00000,  0.00000,  ...,  0.00000,  0... 1, 19, 19]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR), <ttnn._ttnn.device.Device object at 0x7fc1377b8670>)
function_kwargs = {'memory_config': MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/tt_metal/impl/buffers/buffer.cpp:49: valid_page_size
E       info:
E       For valid non-interleaved buffers page size 38 must equal buffer size 720. For interleaved-buffers page size should be divisible by buffer size

If I block aten.masked_fill.Scalar of input varations ["Tensor<[1, 1, 19, 19]> self = ?", "Tensor<[1, 1, 19, 19]> mask = ?", "number value = -3.3895313892515355e+38"], then above err msg not shown, guess there's something wrong with masked_fill lowering

The reproduce step is

  1. Remove this block in to_tt_guard.py
    aten_masked_fill_scalar_blocklist += [
    ["Tensor<[1, 1, 19, 19]> self = ?", "Tensor<[1, 1, 19, 19]> mask = ?", "number value = -3.3895313892515355e+38"]
    ]
  2. pytest tests/models/xglm/test_xglm.py

After this issue is resolved, please also remove the related blocklist in to_tt_guard.py.

swimdi commented 3 days ago

swin_b also has similar issue with aten.masked_fill.Scalar, the err msg is

self = FastOperation(python_fully_qualified_name='ttnn.to_device', function=<built-in method to_device of PyCapsule object at...function default_postprocess_golden_function_outputs at 0x7fe6d1be9d30>, is_cpp_operation=False, is_experimental=False)
function_args = (ttnn.Tensor([[[ 0.00000,  0.00000,  ...,  0.00000,  0.00000],
              [ 0.00000,  0.00000,  ...,  0.00000,  0.0...64, 49, 49]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR), <ttnn._ttnn.device.Device object at 0x7fe69fa929f0>)
function_kwargs = {'memory_config': MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/tt_metal/impl/buffers/buffer.cpp:52: page_size % sizeof(uint32_t) == 0

The reproduce step is

  1. Remove this line in to_tt_guard.py
    ["Tensor<[64, 49, 49]> self = ?", "Tensor<[64, 49, 49]> mask = ?", "number value = -100.0"],
  2. only leave [("swin_b", "Swin_B_Weights"), "eval"], in test_torchvision_image_classification.py
  3. pytest tests/models/torchvision/test_torchvision_image_classification.pyy
jerrysky3 commented 2 days ago

From the dump of model IR, I think it is because we are trying to load an (1, 1, 19, 19) tensor to device in ROW_MAJOR layout, which FWIW is not supported in ttnn because each page needs to align to sizeof(int32_t) and in ROW_MAJOR layout each row is one page (so we are trying to create with a page size 38 bytes).

ttnn_from_torch_3 = ttnn_decorators_ttnn_from_torch(arg390_1, layout = ttnn_ROW_MAJOR_LAYOUT, dtype = ttnn_bfloat16);  arg390_1 = None
ttnn_reshape_3 = ttnn_decorators_ttnn_reshape(ttnn_from_torch_3, (1, 1, 19));  ttnn_from_torch_3 = None
ttnn_from_device_1 = ttnn_decorators_ttnn_from_device(ttnn_reshape_3);  ttnn_reshape_3 = None
ttnn_to_layout_1 = ttnn_decorators_ttnn_to_layout(ttnn_from_device_1, ttnn_ROW_MAJOR_LAYOUT);  ttnn_from_device_1 = None
ttnn_reshape_4 = ttnn_decorators_ttnn_reshape(ttnn_to_layout_1, (1, 1, 1, 19));  ttnn_to_layout_1 = None
ttnn_to_torch_1 = ttnn_decorators_ttnn_to_torch(ttnn_reshape_4);  ttnn_reshape_4 = None
expand_default_1 = torch.ops.aten.expand.default(ttnn_to_torch_1, [1, 1, 19, 19]);  ttnn_to_torch_1 = None
rsub_scalar = torch.ops.aten.rsub.Scalar(expand_default_1, 1.0);  expand_default_1 = None
_to_copy_default_2 = torch.ops.aten._to_copy.default(rsub_scalar, dtype = torch.bool)
ttnn_from_torch_4 = ttnn_decorators_ttnn_from_torch(_to_copy_default_2, layout = ttnn_ROW_MAJOR_LAYOUT, dtype = ttnn_bfloat16, device = ttnn_Specified_Device);  _to_copy_default_2 = None

Note that before the last ttnn_from_torch_4 = ttnn_decorators_ttnn_from_torch, all tensors above I think are on host instead of device.

Here is a simple program to reproduce this issue:

import torch
import ttnn

def main(device):
    torch_tensor = torch.rand((1, 1, 19, 19), dtype=torch.bfloat16)
    input_tensor = ttnn.from_torch(torch_tensor, layout=ttnn.ROW_MAJOR_LAYOUT, device=device)
    print(input_tensor)

if __name__ == "__main__":
    device = ttnn.open_device(device_id=0)
    try:
        main(device)
    finally:
        ttnn.close_device(device)

It gave me the error:

RuntimeError: TT_FATAL @ ../ttnn/cpp/ttnn/tensor/layout/page_config.cpp:121: (widthAlignment % page_alignment) == 0
info:
Wrong custom Tensor Layout alignment Alignment([19, 19, 19, 19]). For Row Major layout with element size 2bytes the innermost dimension must align to 2. This is because Buffer data is packed as uint32_t (4 bytes).

Hi @ayerofieiev-tt , do you know if there is a proper way to create a device tensor with an odd inner-most dim (e.g. (19, 19)) in row major layout? I think we encounter this issue in multiple models. Currently I know we can create such tensors in tile layout, but in some places it might not be feasible or easy to do