tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

[Bug Report] invalid acosh backward result #6583

Open hschoi4448 opened 6 months ago

hschoi4448 commented 6 months ago

Describe the bug A clear and concise description of what the bug is.

The acosh_bw function returns an invalid gradient value.

To Reproduce Steps to reproduce the behavior:

  1. Copy and past below code to /tests/tt_eager/python_api_testing/unit_testing/backward_ops/test_backward_acosh.py
    
    # SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

SPDX-License-Identifier: Apache-2.0

import torch import pytest import tt_lib from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1): pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16() tt_tensor = ( tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device) ) return pt_tensor, tt_tensor

@pytest.mark.parametrize( "input_shapes", ( (torch.Size([1, 1, 32, 32])), ), ) def test_bw_acosh(input_shapes, device): in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=0.5) grad_data, grad_tensor = data_gen_pt_tt(input_shapes, device, False, val=1)

print("input_tensor", input_tensor)
print("grad_tensor", grad_tensor)

pyt_y = torch.acosh(in_data)

tt_output_tensor_on_device = tt_lib.tensor.acosh_bw(grad_tensor, input_tensor)

in_data.retain_grad()

pyt_y.backward(gradient=grad_data)

golden_tensor = [in_data.grad]

comp_pass = compare_results(tt_output_tensor_on_device, golden_tensor)

print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)
assert comp_pass
2. Run cmd `pytest ./tests/tt_eager/python_api_testing/unit_testing/backward_ops/test_backward_acosh.py
```Python
input_tensor ttnn.Tensor([[[[ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               ...,
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000],
               [ 0.50000,  0.50000,  ...,  0.50000,  0.50000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
grad_tensor ttnn.Tensor([[[[ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               ...,
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000],
               [ 1.00000,  1.00000,  ...,  1.00000,  1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
tt_output_tensor_on_device [ttnn.Tensor([[[[inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ],
               ...,
               [inf     , inf     ,  ..., inf     , inf     ],
               [inf     , inf     ,  ..., inf     , inf     ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)]
golden_tensor [tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], dtype=torch.bfloat16)]

Expected behavior A clear and concise description of what you expected to happen.

I want acosh_bw to return the correct gradient

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context Add any other context about the problem here.

umadevimcw commented 6 months ago

@hschoi4448 Without any change in acosh_bw logic I am getting following output

FYI, We are testing and developing in GS VM

image
hschoi4448 commented 6 months ago

I'm using wormhole_b0. My commit is fcd006b981fa1b0d8a0f7395060513d4e402a7a9` but the issue still persists.

I will also test it in grayskull.

image

hschoi4448 commented 6 months ago

@hschoi4448 Without any change in acosh_bw logic I am getting following output FYI, We are testing and developing in GS VM

I tested the same code again in GS, and the test passed. It seems like the issue might be with wormhole.

umadevimcw commented 5 months ago

@tt-aho @eyonland @jliangTT In WHB0 we are unable to store nan in tensor. The above is with respect to storing the nan. For GS the some code works fine. Need input on this.

How to address this issue? or Can I close this issue considering the hardware limitations?

umadevimcw commented 3 months ago

@hschoi4448 @razorback3 https://github.com/tenstorrent/tt-metal/issues/8944, https://github.com/tenstorrent/tt-metal/issues/8945#issuecomment-2146247945, Please look at this comments for this issue