tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
484 stars 79 forks source link

[Bug Report] Invalid log result #6721

Open hschoi4448 opened 8 months ago

hschoi4448 commented 8 months ago

Describe the bug A clear and concise description of what the bug is.

The log function returns an invalid value.

To Reproduce Steps to reproduce the behavior:

  1. Copy and past below code
    
    # SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

SPDX-License-Identifier: Apache-2.0

import torch import pytest import tt_lib from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

import ttnn from tests.tt_eager.python_api_testing.sweep_tests import pytorch_ops

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1): pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16() tt_tensor = ( tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device) ) return pt_tensor, tt_tensor

@pytest.mark.parametrize( "input_shapes", ( (torch.Size([1, 1, 32, 32])), ), ) def test1(input_shapes, device): val = -1 in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=val)

print("input_tensor", input_tensor)

golden_tensor = pytorch_ops.log(in_data)
tt_output_tensor_on_device = tt_lib.tensor.log(input_tensor)

print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)

2. Run with pytest
```Python
input_tensor ttnn.Tensor([[[[-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               ...,
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
tt_output_tensor_on_device ttnn.Tensor([[[[-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               ...,
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
golden_tensor tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], dtype=torch.bfloat16,
       grad_fn=<LogBackward0>)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context Add any other context about the problem here.

It seems like other log operations such as log2 and log10 may also have similar issues.

Aswinmcw commented 7 months ago

Hi @hschoi4448 , @eyonland @tt-aho @jliangTT for fixing this issue, we need to add some conditions in ckernal file, for wh it is located at tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_log.h but it is not available for gs, so the changes are need to be done in submodules. We tried the same to add the changes in submodules and in PR creation page, those changes are not shown Can I get some suggestions regarding this?

tt-aho commented 7 months ago

Hi @hschoi4448 , @eyonland @tt-aho @jliangTT for fixing this issue, we need to add some conditions in ckernal file, for wh it is located at tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_log.h but it is not available for gs, so the changes are need to be done in submodules. We tried the same to add the changes in submodules and in PR creation page, those changes are not shown Can I get some suggestions regarding this?

@Aswinmcw I'm not sure what the issue you're seeing is. Hitting the link you sent I see your changes shown below image

tt-aho commented 7 months ago

You would need to push/merge your changes in the appropriate submodule repos and then update the submodule hashes appropriately

Aswinmcw commented 7 months ago

Hi @hschoi4448 , @eyonland @tt-aho @jliangTT for fixing this issue, we need to add some conditions in ckernal file, for wh it is located at tt_metal/hw/ckernels/wormhole_b0/metal/llk_api/llk_sfpu/ckernel_sfpu_log.h but it is not available for gs, so the changes are need to be done in submodules. We tried the same to add the changes in submodules and in PR creation page, those changes are not shown Can I get some suggestions regarding this?

@Aswinmcw I'm not sure what the issue you're seeing is. Hitting the link you sent I see your changes shown below image

@tt-aho , i have changed 3 files in total, 1 in tt repo and 2 in submodules, in changes page, those 2 files in submodules is not showing, thats my doubt

tt-aho commented 7 months ago

Submodule changes only show the hash changes, not the files changed. You would need to click/compare the changes on the submodule repo, but I think you need to push your changes to the actual remote submodule repos first as I can't find the commits you're updating it to in them.

hschoi4448 commented 7 months ago
input_tensor ttnn.Tensor([[[[-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               ...,
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000],
               [-1.00000, -1.00000,  ..., -1.00000, -1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
tt_output_tensor_on_device ttnn.Tensor([[[[-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               ...,
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500],
               [-4.37500, -4.37500,  ..., -4.37500, -4.37500]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)
golden_tensor tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], dtype=torch.bfloat16,
       grad_fn=<LogBackward0>)

still has a problem. Please check again. @Aswinmcw @tt-aho @jliangTT @razorback3

hash: a059360c5f356621d37963009ccd5a8089a71432 arch: wormhole_b0

ruthreshx commented 2 months ago

Hi @rtawfik01 , Issues happens due to -ve value hasn't been handled. Since the file is from third_party we couldn't able to handle such issues from our end and we verified locally after our change.

Is there an way to made a change and raise PR for the same?

ruthreshx commented 2 months ago

Merged, hence closed!

umadevimcw commented 2 months ago

@ruthreshx Can you add the description of you observation here?

ruthreshx commented 2 months ago

Hi @umadevimcw , @eyonland , @hschoi4448 , The formula is: power_log = mul (log(x), scalar) x = -1.5469, scalar = 0.5371 With the above fix. Step1: log(-1.5469) => NaN step2: NaN * 0.5 = returns 255211775190703847597530955573826158592.00000 expected result is NaN

I noticed one more thing, If NaN multiplies with any decimal value from 0.0 to 0.66 it returns above bigger value.

The fix is proper only, because for the -ve value log should return NaN.

We need to create a separate ticket over there, to handle such issues from the mul_sfpu kernel. Between we will skip the pow_fractional test until the issue is been handled.

ruthreshx commented 2 months ago

Hi @hschoi4448 , @eyonland , Please find the separate ticket for handling NaN issues in mul_sfpu. Ticket: https://github.com/tenstorrent/tt-metal/issues/12776

KalaivaniMCW commented 1 month ago

Current status: https://github.com/tenstorrent/tt-metal/issues/12776 Need inputs from LLK team