tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
466 stars 73 forks source link

line_all_gather issue with device async enabled #11089

Closed kpaigwar closed 2 months ago

kpaigwar commented 3 months ago

Description

After integrating new API changes for ttnn.line_all_gather, seeing the below error when enabling the device async on galaxy machine.

for i in device_mesh.get_device_ids():
    device = device_mesh.get_device(i)
    device.enable_async(True)

Error

Always | FATAL   | Device not found in any view                                                             │·
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ ../ttnn/cpp/ttnn/operations/ccl/line_all_gather/device/line_all_gather_op.cpp:162: selected_view !│·
= nullptr                                                                                         │·
info:                                                                                           │·
Device not found in any view

Repro

git checkout kpaigwar/repro_async_issue
pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_device_line_all_gather_8x4_data_async_issue
kpaigwar commented 3 months ago

fyi @SeanNijjar @cglagovichTT @uaydonat @djordje-tt

SeanNijjar commented 3 months ago

Hey @kpaigwar, which type of machine is this?

kpaigwar commented 3 months ago

Hey @SeanNijjar and @cfjchu , this is galaxy machine. I have added the unit test for repro this issue

kpaigwar commented 3 months ago

pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_device_line_all_gather_8x4_data_async_issue This will run two tests, you will see first test passing with async_mode turned off and second will run into seg_fault

tt-asaigal commented 3 months ago

Hey @kpaigwar, in case you're blocked here's a branch with changes rebased on top your branch that should work: https://github.com/tenstorrent/tt-metal/tree/asaigal/issue_11089 @cfjchu and I will properly uplift these changes to main.

kpaigwar commented 3 months ago

Thanks @tt-asaigal for the update

kpaigwar commented 3 months ago

@tt-asaigal and @cfjchu , the fix is working on our demo.

cfjchu commented 2 months ago

@kpaigwar fixes are now in main. Please retry and let us know if any issues.

kpaigwar commented 2 months ago

@cfjchu, tried from main, issue has been resolved.

cfjchu commented 2 months ago

@cfjchu, tried from main, issue has been resolved.

great to hear - thx!

uaydonat commented 2 months ago

close?

cfjchu commented 2 months ago

close?

yes, it's closed