tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 74 forks source link

[Feature Request] Improvement Needed for Unit Tests #6633

Open hschoi4448 opened 7 months ago

hschoi4448 commented 7 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I recently reviewed the backward ops and found several bugs.

  1. https://github.com/tenstorrent-metal/tt-metal/issues/6533
  2. https://github.com/tenstorrent-metal/tt-metal/issues/6534
  3. https://github.com/tenstorrent-metal/tt-metal/issues/6536
  4. https://github.com/tenstorrent-metal/tt-metal/issues/6583
  5. https://github.com/tenstorrent-metal/tt-metal/issues/6589
  6. https://github.com/tenstorrent-metal/tt-metal/issues/6598

I believe there are two main reasons why there were so many bugs in backward ops:

  1. The compare_result function often returns a pass even when the correct value and the TT result differ significantly. This may result in many bugs going unnoticed. ex) https://github.com/tenstorrent-metal/tt-metal/issues/6598 image

  2. The input data used in unit tests does not always reflect the characteristics of the ops For instance, in the case of relu6, the gradient formula varies depending on whether the input falls within the range of 0 to 6. Therefore, to test all intervals effectively, the input data should include values around 0, 6, and nearby points, such as [-1, 0, 3, 6, 7]. However, currently, input data is generated using torch.randn, results in values mostly around [-1 to 1], neglecting testing in the vicinity of 6 and its surrounding intervals.

ex) image

I didn't run all unit tests during the review, and only checked the suspicious parts, so I believe there are actually more bugs. Improving unit tests seems to be a high priority to address recurring issues and find hidden bugs.

Describe the solution you'd like A clear and concise description of what you want to happen.

  1. Regarding the first issue, I'm not sure what the best solution would be.
  2. For the second issue, a method is needed to input specific values related to the characteristics of the op as test data.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

razorback3 commented 7 months ago

I think compare_result may be changed to pass both pcc check and allclose check. (For right now, I see that tests are made to be passed when one of the checks is passed) Moreover, allowing user to set atol and rtol for compare_result should be added.

jliangTT commented 7 months ago

yes. @razorback3. i will follow up and get back to you with a response. This is really not good.

umadevimcw commented 7 months ago

@razorback3 @jliangTT @jvasilje We have updated the datagen and comparison function in PR #6679 for debugging. We are parallel working on updating the test files with reference to this PR. Once the PR is merged with the main will submit the changes for all the Ops in a separate PR

umadevimcw commented 7 months ago

@razorback3 @jliangTT @jvasilje @hschoi4448 We are also observing the following scenario (not sure whether this is an expected behavior),

To handle this we have to add the condition separately in the logic. We are checking this kind of issue also.

Scenario 1:

image

Scenario 2:

image

Also few ops depends on this issue as well https://github.com/tenstorrent-metal/tt-metal/issues/6676

jliangTT commented 7 months ago
jliangTT commented 7 months ago

Status:

I think we can downgrade this to p1.

razorback3 commented 7 months ago

@jliangTT Would you give access permission of google docs to me and @hschoi4448 ?

jaehoon.jung@moreh.io hyungsuk.choi@moreh.io

jliangTT commented 7 months ago

please see this doc - https://docs.google.com/spreadsheets/d/1VV-EwGJn1EgBN3jX3tg4TcX_yDkm5HAO/edit#gid=66577367

(sorry i have to made a copy due to not being able to get around access control)

jliangTT commented 7 months ago

will close this one for now. Can track the issue. Please re-open if you have any concerns.

razorback3 commented 7 months ago

OK. @hschoi4448 will double-check the result when he comes back from his vacation.

hschoi4448 commented 7 months ago

https://github.com/tenstorrent/tt-metal/issues/6583 still has a problem.

  1. Test cdoe
    
    # SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

SPDX-License-Identifier: Apache-2.0

import torch import pytest import tt_lib from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results

def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1): pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16() tt_tensor = ( tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device) ) return pt_tensor, tt_tensor

@pytest.mark.parametrize( "input_shapes", ( (torch.Size([1, 1, 32, 32])), ), ) def test_bw_acosh(input_shapes, device): in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=0.5) grad_data, grad_tensor = data_gen_pt_tt(input_shapes, device, False, val=1)

print("input_tensor", input_tensor)
print("grad_tensor", grad_tensor)

pyt_y = torch.acosh(in_data)

tt_output_tensor_on_device = tt_lib.tensor.acosh_bw(grad_tensor, input_tensor)

in_data.retain_grad()

pyt_y.backward(gradient=grad_data)

golden_tensor = [in_data.grad]

comp_pass = compare_results(tt_output_tensor_on_device, golden_tensor)

print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)
assert comp_pass

2. output

input_tensor ttnn.Tensor([[[[ 0.50000, 0.50000, ..., 0.50000, 0.50000], [ 0.50000, 0.50000, ..., 0.50000, 0.50000], ..., [ 0.50000, 0.50000, ..., 0.50000, 0.50000], [ 0.50000, 0.50000, ..., 0.50000, 0.50000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE) grad_tensor ttnn.Tensor([[[[ 1.00000, 1.00000, ..., 1.00000, 1.00000], [ 1.00000, 1.00000, ..., 1.00000, 1.00000], ..., [ 1.00000, 1.00000, ..., 1.00000, 1.00000], [ 1.00000, 1.00000, ..., 1.00000, 1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE) 2024-04-16 01:45:38.796 | ERROR | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not. 2024-04-16 01:45:38.796 | ERROR | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not. 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:62 - False 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:63 - False 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:64 - Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: 0.0, PCC check failed tt_output_tensor_on_device [ttnn.Tensor([[[[inf , inf , ..., inf , inf ], [inf , inf , ..., inf , inf ], ..., [inf , inf , ..., inf , inf ], [inf , inf , ..., inf , inf ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)] golden_tensor [tensor([[[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]]], dtype=torch.bfloat16)]

hschoi4448 commented 7 months ago

Could you please check if the issues below are still problematic?

umadevimcw commented 7 months ago

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem

If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

hschoi4448 commented 7 months ago

@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem

* In WHB0 storing Nan is  not possible

* Storing Nan/inf in ckernel op is related to performance problem

If you check each issue we have added observation and even for few ops we have raised the PR with fix

PR is not merged due to pending approval from code owner

Please find @rtawfik01 comment below

These checks also affect performance, are the users alright with a performance hit?

Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.

Understood. If it's a hardware issue with limitations on performance and functionality, it seems that it's not something I can decide on, so I'll pass it on to my team. @razorback3

VirdhatchaniKN commented 3 weeks ago

Updates :

VirdhatchaniKN commented 2 weeks ago

Updates :