tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
452 stars 66 forks source link

ttnn.mish fails due to trisc1 build failed in GS #9877

Closed punithsekar closed 3 months ago

punithsekar commented 3 months ago

Describe the bug ttnn.mish fails with trisc1 build failed error.

To Reproduce Steps to reproduce the behavior: Run the following code snippet.

import ttnn 
import torch
import pytest 

@pytest.mark.parametrize("device_params", [{"l1_small_size": 32768}], indirect=True)
def test(device):
    a=torch.randn((1,1,102400,32),dtype=torch.float16)
    ttnn_input_tensor = ttnn.from_torch(
        a,
        dtype=ttnn.bfloat16,
        memory_config=ttnn.L1_MEMORY_CONFIG,
        device=device,
        layout=ttnn.TILE_LAYOUT,
    )
    output=ttnn.mish(ttnn_input_tensor)

Expected behavior Execution of the operation without any issue.

Screenshots

    def wrapper(*args, **kwargs):
>       return function(*args, **kwargs)
E       RuntimeError: TT_THROW @ ../tt_metal/impl/program/program.cpp:35: tt::exception
E       info:
E       Failed to generate binaries for eltwise_sfpu TT_THROW @ ../tt_metal/jit_build/build.cpp:396: tt::exception
E       info:
E       trisc1 build failed
E       backtrace:
E        --- /home/ubuntu/punith/tt-metal/build/lib/libtt_metal.so(+0xd0e9e) [0x7f1e785a2e9e]
E        --- tt::tt_metal::JitBuildState::compile_one(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, tt::tt_metal::JitBuildSettings const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) const

Please complete the following environment information:

eyonland commented 3 months ago

I am unable to reproduce this issue with the test attached.

pytest test_9877.py 2024-07-17 19:38:26.088 | DEBUG | ttnn::140 - Loading ttnn configuration overrides from environment variable TTNN_CONFIG_OVERRIDES 2024-07-17 19:38:26.089 | DEBUG | ttnn::143 - Initial ttnn.CONFIG: {'cache_path': PosixPath('/home/eyon/.cache/ttnn'), 'comparison_mode_pcc': 0.9999, 'enable_comparison_mode': False, 'enable_detailed_buffer_report': False, 'enable_detailed_tensor_report': False, 'enable_fast_runtime_mode': True, 'enable_graph_report': False, 'enable_logging': False, 'enable_model_cache': False, 'model_cache_path': PosixPath('/home/eyon/.cache/ttnn/models'), 'report_name': None, 'root_report_path': PosixPath('generated/ttnn/reports'), 'throw_exception_on_fallback': False, 'tmp_dir': PosixPath('/tmp/ttnn')} 2024-07-17 19:38:26.234 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.logical_xor be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.xlogy be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.maximum be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.minimum be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.atan2 be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.hypot be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.nextafter be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.polyval be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.isclose be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.all_gather be migrated to C++? 2024-07-17 19:38:26.235 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.pearson_correlation_coefficient be migrated to C++? 2024-07-17 19:38:26.239 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.conv2d be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.reshape be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.unsqueeze_to_4D be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.squeeze be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.from_torch be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.to_torch be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.to_device be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.from_device be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.allocate_tensor_on_device be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.copy_host_to_device_tensor be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.deallocate be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.clone be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.reallocate be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.load_tensor be migrated to C++? 2024-07-17 19:38:26.240 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.dump_tensor be migrated to C++? 2024-07-17 19:38:26.241 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.as_tensor be migrated to C++? 2024-07-17 19:38:26.241 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.arange be migrated to C++? 2024-07-17 19:38:26.241 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.mse_loss be migrated to C++? 2024-07-17 19:38:26.241 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.l1_loss be migrated to C++? 2024-07-17 19:38:26.242 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.matmul be migrated to C++? 2024-07-17 19:38:26.242 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.linear be migrated to C++? 2024-07-17 19:38:26.242 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.mac be migrated to C++? 2024-07-17 19:38:26.243 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.addcmul be migrated to C++? 2024-07-17 19:38:26.243 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.addcdiv be migrated to C++? 2024-07-17 19:38:26.243 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.lerp be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.logit be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.polygamma be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.hardshrink be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.celu be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.softshrink be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.clip be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.threshold be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.glu be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.reglu be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.swiglu be migrated to C++? 2024-07-17 19:38:26.245 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.geglu be migrated to C++? 2024-07-17 19:38:26.249 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.matmul be migrated to C++? 2024-07-17 19:38:26.249 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.linear be migrated to C++? 2024-07-17 19:38:26.250 | WARNING | ttnn.decorators:operation_decorator:790 - Should ttnn.conv2d be migrated to C++? ============================================================= test session starts ============================================================== platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.5.0 -- /home/eyon/git/tt-metal/python_env/bin/python3 cachedir: .pytest_cache rootdir: /home/eyon/git/tt-metal, configfile: pytest.ini plugins: split-0.8.2, xdist-3.6.1, timeout-2.2.0, anyio-4.4.0, dash-2.15.0 timeout: 300.0s timeout method: signal timeout func_only: False collected 1 item

test_9877.py::test[device_params0] ⠁ Initializing Chip
Detecting chips (found 8)
2024-07-17 19:38:26.694 | INFO | SiliconDriver - Detected 4 PCI devices : [0, 1, 2, 3] 2024-07-17 19:38:26.773 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0) 2024-07-17 19:38:26.774 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 4) 2024-07-17 19:38:26.783 | INFO | SiliconDriver - Detected 4 PCI devices : [0, 1, 2, 3] 2024-07-17 19:38:26.810 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1) 2024-07-17 19:38:26.811 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 5) 2024-07-17 19:38:26.819 | INFO | SiliconDriver - Detected 4 PCI devices : [0, 1, 2, 3] 2024-07-17 19:38:26.847 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2) 2024-07-17 19:38:26.848 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 6) 2024-07-17 19:38:26.856 | INFO | SiliconDriver - Detected 4 PCI devices : [0, 1, 2, 3] 2024-07-17 19:38:26.884 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3) 2024-07-17 19:38:26.885 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 7) Always | DEBUG | Initializing firmware Always | DEBUG | Waiting for firmware init complete Always | DEBUG | Firmware init complete Op | DEBUG | Started C++ ttnn operation: ttnn::to_layout Op | DEBUG | Finished C++ ttnn operation: ttnn::to_layout Op | DEBUG | Started C++ ttnn operation: ttnn::mish Op | DEBUG | Started C++ ttnn operation: ttnn::softplus Op | DEBUG | Launching Operation: "Unary &" (device) Op | DEBUG | Attributes: Op | DEBUG | op_chain = {UnaryWithParam(op_type=UnaryOpType::SOFTPLUS,param={1, 20})} Op | DEBUG | output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt) Op | DEBUG | fp32_dest_acc_en = false Op | DEBUG | preserve_fp32_precision = false Op | DEBUG | output_dtype = DataType::BFLOAT16 Op | DEBUG | Input Tensors: Op | DEBUG | 0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 102400, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE) Op | DEBUG | Op | DEBUG | Finished C++ ttnn operation: ttnn::softplus Op | DEBUG | Started C++ ttnn operation: ttnn::tanh Op | DEBUG | Launching Operation: "Unary &" (device) Op | DEBUG | Attributes: Op | DEBUG | op_chain = {UnaryWithParam(op_type=UnaryOpType::TANH,param={})} Op | DEBUG | output_mem_config = MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt) Op | DEBUG | fp32_dest_acc_en = false Op | DEBUG | preserve_fp32_precision = false Op | DEBUG | output_dtype = DataType::BFLOAT16 Op | DEBUG | Input Tensors: Op | DEBUG | 0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 102400, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE) Op | DEBUG | Op | DEBUG | Finished C++ ttnn operation: ttnn::tanh Op | DEBUG | Started C++ ttnn operation: ttnn::multiply Op | DEBUG | Launching Operation: "BinaryDeviceOperation" (DeviceOperation) Op | DEBUG | Attributes: Op | DEBUG | binary_op_type = BinaryOpType::MUL Op | DEBUG | in_place = false Op | DEBUG | activations = std::nullopt Op | DEBUG | memory_config = MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt) Op | DEBUG | dtype = DataType::BFLOAT16 Op | DEBUG | compute_kernel_config = std::nullopt Op | DEBUG | Tensors Args: Op | DEBUG | 0: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 102400, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE) Op | DEBUG | 1: Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::L1,shard_spec=std::nullopt)),shape=ttnn.Shape([1, 1, 102400, 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE) Op | DEBUG | Op | DEBUG | Finished C++ ttnn operation: ttnn::multiply Op | DEBUG | Finished C++ ttnn operation: ttnn::mish PASSED

==================================================================== PASSES ==================================================================== ============================================================== slowest durations =============================================================== 2.31s setup test_9877.py::test[device_params0] 2.27s call test_9877.py::test[device_params0] 0.01s teardown test_9877.py::test[device_params0] =========================================================== short test summary info ============================================================ PASSED test_9877.py::test[device_params0] ============================================================== 1 passed in 4.60s ===============================================================

mbahnasTT commented 3 months ago

@punithsekar can you please try from the main branch? the unit test as well as the whole model graph? Thanks! @dvartaniansTT fyi

mbahnasTT commented 3 months ago

Thanks @eyonland, let's check again and confirm here

dvartaniansTT commented 3 months ago

@punithsekar thanks for creating the issue. lets please follow this guidlines for all issues moving forward:

create a unit test for failure and share the command to run the test.

Indicate which branch and which card you are using. for instance E150 for gs N150 for single chip WH N300 for dual chip Wh ...

the git branch or tag.

your build steps: for instance built from source...

so for me it typically looks like this:

  1. git checkout [v0.50.0](https://github.com/tenstorrent/tt-metal/tree/v0.50.0)
  2. git submodule update --init --recursive
  3. build from source following the build instruction on v0.50.0
  4. pytest SOME-PYTHON-UNIT-TEST.

sysytem info: Ubuntu 20 N150 sw commit/branch v0.50.0

then attach a screenshot of the error you are seeing.

eyonland commented 3 months ago

I see this code running

// mish[x] = x*tanh[softplus[x]]
// use transformation y = x*tanh[softplus[x]] by broadcast
// Ref: https://krutikabapat.github.io/Swish-Vs-Mish-Latest-Activation-Functions/
Tensor _mish(const Tensor& x, const std::optional<MemoryConfig>& output_mem_config) {
    std::vector<Tensor> output_tensors = {Tensor(operation::get_workers_for_op_output({x}))};
    operation::launch_op(
        [output_mem_config](
            const std::vector<Tensor>& input_tensors,
            const std::vector<std::optional<const Tensor>>& optional_input_tensors,
            const std::vector<std::optional<Tensor>>& optional_output_tensors) mutable -> std::vector<Tensor> {
            const auto& x = input_tensors.at(0);
            Tensor sp_x = ttnn::softplus(x, 1.0f, 20.0f, output_mem_config);
            Tensor tanh_x = ttnn::tanh(sp_x, output_mem_config);
            sp_x.deallocate();
            Tensor mish_x = ttnn::multiply(x, tanh_x, std::nullopt, output_mem_config);
            return {mish_x};
        },
        {x},
        output_tensors);
    return output_tensors.at(0);
}

Created test_9877.py with...

import ttnn 
import torch
import pytest 

@pytest.mark.parametrize("device_params", [{"l1_small_size": 32768}], indirect=True)
def test(device):
    a=torch.randn((1,1,102400,32),dtype=torch.float16)
    ttnn_input_tensor = ttnn.from_torch(
        a,
        dtype=ttnn.bfloat16,
        memory_config=ttnn.L1_MEMORY_CONFIG,
        device=device,
        layout=ttnn.TILE_LAYOUT,
    )
    output=ttnn.mish(ttnn_input_tensor)

And then I run this

gdb --args python -m pytest test_9877.py
b unary_composite_op.cpp:280
r

Thread 1 "python" hit Breakpoint 1, ttnn::operations::unary::_mish(tt::tt_metal::Tensor const&, std::1::optional const&)::$_0::operator()(std::1::vector<tt::tt_metal::Tensor, std::1::allocator > const&, std::1::vector<std::1::optional, std::1::allocator<std::1::optional > > const&, std::1::vector<std::1::optional, std::1::allocator<std::__1::optional > > const&) (this=0x5fe2168, input_tensors=..., optional_input_tensors=..., optional_output_tensors=...) at ../ttnn/cpp/ttnn/operations/eltwise/unary/device/unary_composite_op.cpp:283 283 const auto& x = input_tensors.at(0); (gdb) bt

#0  ttnn::operations::unary::_mish(tt::tt_metal::Tensor const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&)::$_0::operator()(std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const> > > const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor> > > const&) (this=0x5fe2168, input_tensors=..., optional_input_tensors=..., optional_output_tensors=...)
    at ../ttnn/cpp/ttnn/operations/eltwise/unary/device/unary_composite_op.cpp:283
#1  0x00007fff8703d6f8 in std::__1::__invoke[abi:ue170006]<ttnn::operations::unary::_mish(tt::tt_metal::Tensor const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&)::$_0&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor> > const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const> > > const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor> > > const&> (__f=..., __args=..., __args=..., __args=...)
    at /usr/lib/llvm-17/bin/../include/c++/v1/__type_traits/invoke.h:340
dvartaniansTT commented 3 months ago

thanks a lot @eyonland !

I'd wait on @punithsekar to provide the details I asked. he might be running from a specific branch and I'm not sure which card he is using.

KalaivaniMCW commented 3 months ago

@mbahnasTT We couldn't reproduce this error on our Wormhole_B0 machine. cc @umadevimcw @eyonland

Screenshot 2024-07-17 at 11 24 45 PM
punithsekar commented 3 months ago

Hi @dvartaniansTT ,

Even in the latest main I face the same issue.(05ff4d77f7702610c58e9f18ed918f347c0dbfeb on this main)

I use the following build commands,

git submodule foreach 'git lfs fetch --all && git lfs pull'
git submodule update --init --recursive
export ARCH_NAME=grayskull
export TT_METAL_HOME=$(pwd)
export PYTHONPATH=$(pwd)
export TT_METAL_ENV=dev

./build_metal.sh
./create_venv.sh
source python_env/bin/activate

pip install -r ./tests/end_to_end_tests/requirements.txt

I am using instance of GS E150 . I didn't use this op in pipeline as it was making issue, Created a new file and tested the op and shared the snippet.

Steps to reproduce:

  1. git checkout punith/mish_unit_test (punith/mish_unit_test)
  2. Run the following command pytest models/experimental/yolov4/reference/unittest.py

Additionally, I also checked with one of the colleague's VM instance of GS, even they face the same. In VM instance of WH, the test works fine without any issues.

Thanks.

punithsekar commented 3 months ago

Hi @mbahnasTT @eyonland @dvartaniansTT , It appears that Mish is functioning properly on WH. The previously mentioned error is only encountered in GS.

KalaivaniMCW commented 3 months ago

Hi @punithsekar , Mish uses Softplus in its implementation which is not available for Grayskull. So Mish will not be available in Grayskull.

cc: @eyonland

eyonland commented 3 months ago

@KalaivaniMCW, please update our documentation to reflect that mish is not supported on GS.

rdjogoTT commented 3 months ago

Hello all, The reason why Softplus was originally not implemented for Grayskull is due to HW limitations (GS has too few general purpose registers in the SPFU to support this). However, there has been an update to the implementation which reduced the register pressure, so it's possible that we could be able to use it on Grayskull now. There is currently nothing active for adding softplus to GS.

eyonland commented 3 months ago

@jvasilje, given this works on WH, should this be implemented on GS? Also, any objection to downgrading to P1?

jvasilje commented 3 months ago

no need for GS