Open hschoi4448 opened 7 months ago
I think compare_result
may be changed to pass both pcc check and allclose check.
(For right now, I see that tests are made to be passed when one of the checks is passed)
Moreover, allowing user to set atol and rtol for compare_result
should be added.
yes. @razorback3. i will follow up and get back to you with a response. This is really not good.
@razorback3 @jliangTT @jvasilje We have updated the datagen and comparison function in PR #6679 for debugging. We are parallel working on updating the test files with reference to this PR. Once the PR is merged with the main will submit the changes for all the Ops in a separate PR
@razorback3 @jliangTT @jvasilje @hschoi4448 We are also observing the following scenario (not sure whether this is an expected behavior),
nan
as expected as a resultTo handle this we have to add the condition separately in the logic. We are checking this kind of issue also.
Scenario 1:
Scenario 2:
Also few ops depends on this issue as well https://github.com/tenstorrent-metal/tt-metal/issues/6676
Status:
I think we can downgrade this to p1.
@jliangTT Would you give access permission of google docs to me and @hschoi4448 ?
jaehoon.jung@moreh.io hyungsuk.choi@moreh.io
please see this doc - https://docs.google.com/spreadsheets/d/1VV-EwGJn1EgBN3jX3tg4TcX_yDkm5HAO/edit#gid=66577367
(sorry i have to made a copy due to not being able to get around access control)
will close this one for now. Can track the issue. Please re-open if you have any concerns.
OK. @hschoi4448 will double-check the result when he comes back from his vacation.
https://github.com/tenstorrent/tt-metal/issues/6583 still has a problem.
# SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.
import torch import pytest import tt_lib from tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs import data_gen_pt_tt, compare_results
def data_gen_pt_tt(input_shapes, device, required_grad=False, val=1): pt_tensor = (torch.ones(input_shapes, requires_grad=required_grad) * val).bfloat16() tt_tensor = ( tt_lib.tensor.Tensor(pt_tensor, tt_lib.tensor.DataType.BFLOAT16).to(tt_lib.tensor.Layout.TILE).to(device) ) return pt_tensor, tt_tensor
@pytest.mark.parametrize( "input_shapes", ( (torch.Size([1, 1, 32, 32])), ), ) def test_bw_acosh(input_shapes, device): in_data, input_tensor = data_gen_pt_tt(input_shapes, device, True, val=0.5) grad_data, grad_tensor = data_gen_pt_tt(input_shapes, device, False, val=1)
print("input_tensor", input_tensor)
print("grad_tensor", grad_tensor)
pyt_y = torch.acosh(in_data)
tt_output_tensor_on_device = tt_lib.tensor.acosh_bw(grad_tensor, input_tensor)
in_data.retain_grad()
pyt_y.backward(gradient=grad_data)
golden_tensor = [in_data.grad]
comp_pass = compare_results(tt_output_tensor_on_device, golden_tensor)
print("tt_output_tensor_on_device", tt_output_tensor_on_device)
print("golden_tensor", golden_tensor)
assert comp_pass
2. output
input_tensor ttnn.Tensor([[[[ 0.50000, 0.50000, ..., 0.50000, 0.50000], [ 0.50000, 0.50000, ..., 0.50000, 0.50000], ..., [ 0.50000, 0.50000, ..., 0.50000, 0.50000], [ 0.50000, 0.50000, ..., 0.50000, 0.50000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE) grad_tensor ttnn.Tensor([[[[ 1.00000, 1.00000, ..., 1.00000, 1.00000], [ 1.00000, 1.00000, ..., 1.00000, 1.00000], ..., [ 1.00000, 1.00000, ..., 1.00000, 1.00000], [ 1.00000, 1.00000, ..., 1.00000, 1.00000]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE) 2024-04-16 01:45:38.796 | ERROR | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not. 2024-04-16 01:45:38.796 | ERROR | tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs:get_pcc:32 - One tensor is all nan, the other is not. 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:62 - False 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:63 - False 2024-04-16 01:45:38.797 | DEBUG | tests.tt_eager.python_api_testing.unit_testing.backward_ops.utility_funcs:compare_results:64 - Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: 0.0, PCC check failed tt_output_tensor_on_device [ttnn.Tensor([[[[inf , inf , ..., inf , inf ], [inf , inf , ..., inf , inf ], ..., [inf , inf , ..., inf , inf ], [inf , inf , ..., inf , inf ]]]], shape=Shape([1, 1, 32, 32]), dtype=DataType::BFLOAT16, layout=Layout::TILE)] golden_tensor [tensor([[[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]]], dtype=torch.bfloat16)]
Could you please check if the issues below are still problematic?
@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem
If you check each issue we have added observation and even for few ops we have raised the PR with fix
PR is not merged due to pending approval from code owner
Please find @rtawfik01 comment below
These checks also affect performance, are the users alright with a performance hit?
Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.
@hschoi4448 for all the above tagged issues has hardware and performance limitations where handling/storing Nan /inf is the problem
* In WHB0 storing Nan is not possible * Storing Nan/inf in ckernel op is related to performance problem
If you check each issue we have added observation and even for few ops we have raised the PR with fix
PR is not merged due to pending approval from code owner
Please find @rtawfik01 comment below
These checks also affect performance, are the users alright with a performance hit?
Also for tan op we can't support more range other than -1.45 to 1.45. For above this we have to do reduction operations with modulo operations which is not available.
Understood. If it's a hardware issue with limitations on performance and functionality, it seems that it's not something I can decide on, so I'll pass it on to my team. @razorback3
Updates :
Updates :
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I recently reviewed the backward ops and found several bugs.
I believe there are two main reasons why there were so many bugs in backward ops:
The
compare_result
function often returns a pass even when the correct value and the TT result differ significantly. This may result in many bugs going unnoticed. ex) https://github.com/tenstorrent-metal/tt-metal/issues/6598The input data used in unit tests does not always reflect the characteristics of the ops For instance, in the case of relu6, the gradient formula varies depending on whether the input falls within the range of 0 to 6. Therefore, to test all intervals effectively, the input data should include values around 0, 6, and nearby points, such as [-1, 0, 3, 6, 7]. However, currently, input data is generated using
torch.randn
, results in values mostly around [-1 to 1], neglecting testing in the vicinity of 6 and its surrounding intervals.ex)
I didn't run all unit tests during the review, and only checked the suspicious parts, so I believe there are actually more bugs. Improving unit tests seems to be a high priority to address recurring issues and find hidden bugs.
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.