tenstorrent / pytorch2.0_ttnn

⭐️ TTNN Compiler for PyTorch 2.0 ⭐️ It enables running PyTorch2.0 models on Tenstorrent hardware
https://tenstorrent.github.io/tt-metal/latest/ttnn/
25 stars 6 forks source link

ttnn.add failed with Inputs[1].shape = [1, 12[32], 768] #358

Closed swimdi closed 3 days ago

swimdi commented 2 weeks ago

When I debug albert-base-v2-classification model test, has this error message

self = FastOperation(python_fully_qualified_name='ttnn.add', function=<ttnn._ttnn.operations.binary.add_t object at 0x7f0a5c8...<function default_postprocess_golden_function_outputs at 0x7f0a5c35bd30>, is_cpp_operation=True, is_experimental=False)
function_args = (ttnn.Tensor([[[ 0.74219,  0.07324,  ...,  1.32812,  0.25391],
              [ 0.57031,  0.38086,  ...,  0.37695, -0.1..., ttnn.Tensor(<buffer is not allocated>, shape=Shape([1, 12[32], 768]), dtype=DataType::BFLOAT16, layout=Layout::TILE))
function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_THROW @ /tmp/build-via-sdist-c9nw8bov/metal_libs-0.53.0rc16+wormhole.b0/ttnn/cpp/ttnn/tensor/tensor.hpp:263: tt::exception

It failed at ttnn.add and its input shape is

I'm not sure why the second shape has 12[32] value

And if I block the lowering of this ttnn.add op back to aten.add.Tensor, then this test pass

The reproduce step is

  1. Remove this line in to_tt_guard.py
    aten_add_Tensor_blocklist += [["Tensor<[1, 12, 768]> self = ?", "Tensor<[1, 12, 768]> other = ?"]]
  2. pytest tests/models/albert/test_albert_token_classification.py

After this issue is resolved, please also remove the related blocklist in to_tt_guard.py.

swimdi commented 2 weeks ago

I'm now trying to write a simpler pattern test that can reproduce this error

ayerofieiev-tt commented 2 weeks ago

The issue here looks like its related to ttnn.Tensor(<buffer is not allocated>

jerrysky3 commented 2 weeks ago

It might due to the first operand of ttnn.add is a host tensor instead of the device tensor (while the second operand is a device tensor)

I found the model IR contains the following chain:

ttnn_from_device_3 = ttnn_decorators_ttnn_from_device(ttnn_reshape_3)
ttnn_to_layout_24 = ttnn_decorators_ttnn_to_layout(ttnn_from_device_3, ttnn_ROW_MAJOR_LAYOUT)
ttnn_reshape_8 = ttnn_decorators_ttnn_reshape(ttnn_to_layout_24, (12, 768))
...
ttnn_add_8 = ttnn_decorators_ttnn_add(ttnn_to_layout_24, clone_3)

ttnn_reshape_3 is moved to a host tensor because of the current constraint of ttnn.reshape. It is done by the logic in add_data_move_pass.py:

https://github.com/tenstorrent/pytorch2.0_ttnn/blob/29a728d864458cc3455d2fed5d61de5d32fae36f/torch_ttnn/passes/lowering/add_data_move_pass.py#L320-L341

Then I think the ttnn_reshape_3 is mapped to the host tensor by the code below. Therefore the following ops also reference to the host tensor, including the ttnn.add in this issue

https://github.com/tenstorrent/pytorch2.0_ttnn/blob/29a728d864458cc3455d2fed5d61de5d32fae36f/torch_ttnn/passes/lowering/add_data_move_pass.py#L468-L470

There seems to be the function try_add_layout_change_after_node to move the tensor back to device, so I haven't fully understood why this issue happens

jerrysky3 commented 2 weeks ago

We run into the similar issue on bloom model:

        ttnn_from_torch_4 = ttnn_decorators_ttnn_from_torch(arg295_1, layout = ttnn_ROW_MAJOR_LAYOUT, dtype = ttnn_bfloat16, device = ttnn_Specified_Device)
        ttnn_from_device = ttnn_decorators_ttnn_from_device(ttnn_from_torch_4);  ttnn_from_torch_4 = None
        ttnn_to_layout_1 = ttnn_decorators_ttnn_to_layout(ttnn_from_device, ttnn_ROW_MAJOR_LAYOUT);  ttnn_from_device = None
        ttnn_reshape = ttnn_decorators_ttnn_reshape(ttnn_to_layout_1, (1, 32, 1, 1))
        ttnn_to_layout_2 = ttnn_decorators_ttnn_to_layout(ttnn_reshape, ttnn_TILE_LAYOUT);  ttnn_reshape = None
        ttnn_to_device = ttnn_decorators_ttnn_to_device(ttnn_to_layout_2, device = ttnn_Specified_Device);  ttnn_to_layout_2 = None
        ttnn_moreh_cumsum = ttnn_decorators_ttnn_moreh_cumsum(ttnn_to_device, 1);  ttnn_to_device = None
        ttnn_from_device_1 = ttnn_decorators_ttnn_from_device(ttnn_moreh_cumsum);  ttnn_moreh_cumsum = None
        ttnn_to_layout_3 = ttnn_decorators_ttnn_to_layout(ttnn_from_device_1, ttnn_ROW_MAJOR_LAYOUT);  ttnn_from_device_1 = None
        ttnn_reshape_1 = ttnn_decorators_ttnn_reshape(ttnn_to_layout_3, (1, 32));  ttnn_to_layout_3 = None
        ttnn_to_layout_4 = ttnn_decorators_ttnn_to_layout(ttnn_reshape_1, ttnn_TILE_LAYOUT);  ttnn_reshape_1 = None
        ttnn_to_device_1 = ttnn_decorators_ttnn_to_device(ttnn_to_layout_4, device = ttnn_Specified_Device);  ttnn_to_layout_4 = None
        ttnn_subtract = ttnn_decorators_ttnn_subtract(ttnn_to_device_1, 1);  ttnn_to_device_1 = None
>       ttnn_multiply = ttnn_decorators_ttnn_multiply(ttnn_subtract, ttnn_to_layout_1);  ttnn_subtract = ttnn_to_layout_1 = None

<eval_with_key>.20:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = FastOperation(python_fully_qualified_name='ttnn.multiply', function=<ttnn._ttnn.operations.binary.multiply_t object at...<function default_postprocess_golden_function_outputs at 0x7f92f9310940>, is_cpp_operation=True, is_experimental=False)
function_args = (<[RuntimeError("TT_FATAL @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/tt_metal/impl/device/devic...00000,  0.00000,  ...,  1.00000,  1.00000]], shape=Shape([1, 32]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR))
function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_THROW @ /tmp/build-via-sdist-n2_feo7u/metal_libs-0.53.0rc28+wormhole.b0/ttnn/cpp/ttnn/tensor/tensor.hpp:250: tt::exception
E       info:
E       Cannot get the device from a tensor with host storage
E       backtrace:
swimdi commented 3 days ago

resolved in #392