tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 75 forks source link

Precision issue : Exp2 #13002

Open umadevimcw opened 1 month ago

umadevimcw commented 1 month ago

Describe the bug PCC is dropping due to precision loss for lodaddexp2 functions. While debugging observed that exp2 of certain inputs are zeros whereas in Torch we are getting values at the precision level which results in PCC drop.

To Reproduce Steps to reproduce the behavior:

Copy Paste this code to get the precision loss. In this code the input values are fixed for debugging purposes which show cases the precision loss

# SPDX-FileCopyrightText: © 2023 Tenstorrent Inc.

# SPDX-License-Identifier: Apache-2.0

from loguru import logger
import random
import pytest
import torch
import ttnn

from tests.ttnn.utils_for_testing import assert_with_pcc
from tests.ttnn.python_api_testing.sweep_tests import ttnn_ops

def run_logaddexp2_tests(input_shape, dtype, dlayout, in_mem_config, output_mem_config, data_seed, device):
    torch.manual_seed(data_seed)

    x = torch.Tensor(size=input_shape[0]).uniform_(-100, 100).to(torch.bfloat16)
    y = torch.Tensor(size=input_shape[1]).uniform_(-100, 100).to(torch.bfloat16)

    try:
        # get ref result
        x.fill_(-69.50000)
        y.fill_(-81.00000) # hard coded this for debugging purposes

        print("Exp2 results of Torch....")

        torch.set_printoptions(sci_mode=False, precision=32)
        print(torch.exp2(x))
        print(torch.exp2(y))

        tt_x = ttnn_ops.exp2(
            x,
            device=device,
            dtype=dtype,
            layout=dlayout,
            input_mem_config=in_mem_config,
            output_mem_config=output_mem_config,
        )
        tt_y  = ttnn_ops.exp2(
            y,
            device=device,
            dtype=dtype,
            layout=dlayout,
            input_mem_config=in_mem_config,
            output_mem_config=output_mem_config,
        )

        # # Replicated the logic used in TT 
        # ref_value = torch.logaddexp2(x, y)
        # test_tt_logic = torch.add(torch.exp2(x), torch.exp2(y)) # here result is 0.00000000000000000000119775752698
        # test_tt_logic = torch.log2(test_tt_logic) #here output is -69.5000000

        # tt_result = ttnn_ops.logaddexp2(
        #     x,
        #     y,
        #     device=device,
        #     dtype=dtype,
        #     layout=dlayout,
        #     input_mem_config=in_mem_config,
        #     output_mem_config=output_mem_config,
        # )

    except Exception as e:
        logger.warning(f"Operation execution crashed")
        raise e

    # assert len(tt_result.shape) == len(ref_value.shape)
    # assert tt_result.shape == ref_value.shape
    # ref value is -69.500
    # tt_result is -inf
    print("Exp2 results of TT....")
    print(tt_x)
    print(tt_y)

test_sweep_args2 = [
    (
        [(19, 12), (19, 12)],
        [ttnn.bfloat16, ttnn.bfloat16],
        [ttnn.TILE_LAYOUT, ttnn.TILE_LAYOUT],
        [ttnn.DRAM_MEMORY_CONFIG, ttnn.DRAM_MEMORY_CONFIG],
        (ttnn.DRAM_MEMORY_CONFIG),
        18261510,
    ),
]

@pytest.mark.parametrize(
    "input_shape, dtype, dlayout, in_mem_config, output_mem_config, data_seed",
    (test_sweep_args2),
)
def test_eltwise_logaddexp2(input_shape, dtype, dlayout, in_mem_config, output_mem_config, data_seed, device):
    run_logaddexp2_tests(input_shape, dtype, dlayout, in_mem_config, output_mem_config, data_seed, device)

Expected behavior

Screenshots

Screenshot 2024-09-23 at 7 21 52 PM

Please complete the following environment information:

Additional context TT's exp2 ops internally depends on exp op

rtawfik01 commented 1 month ago

@ttmtrajkovic I have discussed with @umadevimcw offline, and the issue above is that exp2 implementation on device does not output value: 1.1977575e-21, which is re-presentable by float16b, but pytorch with dataformat float16b does represent it, this causes precision failures downstream.

@umadevimcw @eyonland please let us know what is the priority for this issue, if it is only failing unit tests, or failing on models due to the downstream precision issue.

@ttmtrajkovic can re-assign appropriately.

eyonland commented 1 month ago

This is a P1 priority. I'm not aware of any models failing on this at the moment. The related issue is #8634