Assertion Failure in Cluster::set_tunnels_from_mmio_device() When Two WH_B0 Cards are Installed in Host System

dongjin-na commented 4 months ago

Describe the bug To improve development productivity, two WH_B0 cards were installed in the server so that two users can each use one WH_B0 card simultaneously for development. However, when running a program on this system, an assertion failure occurs in the Cluster::set_tunnels_from_mmio_device()function within the tt-metal runtime. (This function was merged in #9515). I would appreciate it if you could let me know if there are any settings I might have missed.

To Reproduce Steps to reproduce the behavior:

Install two WH_B0 cards in the host.
Build the latest version of tt-metal.
Execute any test program.

Expected behavior

Users should be able to create and use devices corresponding to each card.

Screenshots

Logs of the program execution and screenshots of tt-smi are attached.


(python_env) dongjin@ttdev01:~/tt-metal$ ./build/test/tt_eager/ops/test_bmm_op
             Device | INFO     | Opening user mode device driver
Detecting chips (found 4)
2024-07-16 22:04:46.074 | INFO     | SiliconDriver   - Detected 2 PCI devices : [0, 1]
2024-07-16 22:04:46.158 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0)
2024-07-16 22:04:46.165 | INFO     | SiliconDriver   - Detected 2 PCI devices : [0, 1]
2024-07-16 22:04:46.169 | WARNING  | SiliconDriver   - NumHostMemChannels: 2 used for device_id: 0x401e less than target: 3. Workload will fail if it exceeds NumHostMemChannels. Increase Number of Hugepages.
2024-07-16 22:04:46.197 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1)
2024-07-16 22:04:46.198 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2)
2024-07-16 22:04:46.199 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3)
             Always | FATAL    | Loop Exit Error.
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /home/dongjin/tt-metal/tt_metal/llrt/tt_cluster.cpp:656: tunneled_device_hit || (it == device_ids.end())
info:
Loop Exit Error.
backtrace:
--- void tt::assert::tt_assert<char [17]>(char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, bool, char const*, char const (&) [17])
--- tt::Cluster::set_tunnels_from_mmio_device()
--- tt::Cluster::Cluster()
--- tt::Cluster::instance()
--- tt::DevicePool::DevicePool(std::__1::vector<int, std::__1::allocator<int>>, unsigned char, unsigned long, unsigned long, std::__1::vector<unsigned int, std::__1::allocator<unsigned int>> const&)
--- /home/dongjin/tt-metal/build/tt_metal/libtt_metal.so(+0x449919) [0x7fa4c4baa919]
--- tt::tt_metal::CreateDevice(int, unsigned char, unsigned long, unsigned long, std::__1::vector<unsigned int, std::__1::allocator<unsigned int>> const&)
--- ./build/test/tt_eager/ops/test_bmm_op(main+0x5b) [0x5636fc9295db]
--- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fa4c3954083]
--- ./build/test/tt_eager/ops/test_bmm_op(+0x1f4be) [0x5636fc9294be]

Aborted (core dumped)

(python_env) dongjin@ttdev01:~/tt-metal/build/test/tt_metal$ TT_METAL_SLOW_DISPATCH_MODE=1 ./unit_tests_frequent Running main() from /home/dongjin/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest_main.cc [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from Common [ RUN ] Common.AllCoresRunManyTimes Test | INFO | Running iteration #0 Device | INFO | Opening user mode device driver Detecting chips (found 4) 2024-07-16 22:05:43.962 | INFO | SiliconDriver - Detected 2 PCI devices : [0, 1] 2024-07-16 22:05:44.049 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0) 2024-07-16 22:05:44.056 | INFO | SiliconDriver - Detected 2 PCI devices : [0, 1] 2024-07-16 22:05:44.060 | WARNING | SiliconDriver - NumHostMemChannels: 2 used for device_id: 0x401e less than target: 3. Workload will fail if it exceeds NumHostMemChannels. Increase Number of Hugepages. 2024-07-16 22:05:44.088 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1) 2024-07-16 22:05:44.089 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2) 2024-07-16 22:05:44.090 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3) Always | FATAL | Loop Exit Error. unknown file: Failure C++ exception with description "TT_ASSERT @ /home/dongjin/tt-metal/tt_metal/llrt/tt_cluster.cpp:656: tunneled_device_hit || (it == device_ids.end()) info: Loop Exit Error. backtrace: --- void tt::assert::tt_assert<char [17]>(char const, int, std::1::basic_string<char, std::__1::char_traits, std::1::allocator> const&, bool, char const, char const (&) [17]) --- tt::Cluster::set_tunnels_from_mmio_device() --- tt::Cluster::Cluster() --- tt::Cluster::instance() --- tt::tt_metal::GetNumAvailableDevices() --- ./unit_tests_frequent(+0xbd44) [0x5589a093bd44] --- void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) --- void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) --- testing::Test::Run() --- testing::TestInfo::Run() --- testing::TestSuite::Run() --- testing::internal::UnitTestImpl::RunAllTests() --- bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) --- bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) --- testing::UnitTest::Run() --- RUN_ALL_TESTS() --- /home/dongjin/tt-metal/build/lib/libgtest_main.so.1.13.0(main+0x3d) [0x7fbc7018f18d] --- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fbc6facb083] --- ./unit_tests_frequent(+0xaa5e) [0x5589a093aa5e] " thrown in the test body. [ FAILED ] Common.AllCoresRunManyTimes (258 ms) [----------] 1 test from Common (258 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (258 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] Common.AllCoresRunManyTimes

1 FAILED TEST

- tt-smi
<img width="1650" alt="Screenshot 2024-07-16 at 8 56 25 PM" src="https://github.com/user-attachments/assets/90113df1-1158-45b3-a4a6-2a090b7ad033">
<img width="1439" alt="Screenshot 2024-07-16 at 8 56 32 PM" src="https://github.com/user-attachments/assets/5c30baa3-e612-4bd8-ac7e-b0c9c0f8f93c">

<img width="1347" alt="Screenshot 2024-07-16 at 8 56 43 PM" src="https://github.com/user-attachments/assets/f429b7ca-fe46-4f3b-bdff-0a2258de6d3e">

**Please complete the following environment information:**
- OS: [e.g. Ubuntu 20.04] : Ubuntu 20.04
- Version of software (eg. commit) :

commit 354761a4cad252a45335854fa9466ff7b73904dc (HEAD -> main, origin/main, origin/HEAD) Author: Bharane AB bharane.amutha@multicorewareinc.com Date: Tue Jul 16 06:18:24 2024 +0000

#9911: rebase to latest main

namhyeong-kim commented 3 weeks ago

Now, it hangs on the same function Cluster::set_tunnels_from_mmio_device() https://github.com/tenstorrent/tt-metal/blob/7c65c017a2d42ebfa805f767ed4ba7c9ac9a28a2/tt_metal/llrt/tt_cluster.cpp#L685-L700

jvasilje commented 3 weeks ago

@pgkeller this one was recently switched to a P0 - could someone take a look?

jbaumanTT commented 3 weeks ago

I got an IRD reservation with multiple chips. It looks like get_devices_controlled_by_mmio_device is assigning both ethernet-accessible devices to one mmio device, so I'll continue investigating.

jbaumanTT commented 3 weeks ago

The cluster description from .umd/cluster_desc.yaml assigns the same coordinates and ethernet connections to multiple chips.

arch: {
   0: Wormhole,
   1: Wormhole,
   2: Wormhole,
   3: Wormhole,
}

chips: {
   0: [0,0,0,0],
   1: [0,0,0,0],
   2: [1,0,0,0],
   3: [1,0,0,0],
}

ethernet_connections: [
   [{chip: 0, chan: 8}, {chip: 2, chan: 0}],
   [{chip: 0, chan: 9}, {chip: 2, chan: 1}],
   [{chip: 1, chan: 8}, {chip: 3, chan: 0}],
   [{chip: 1, chan: 9}, {chip: 3, chan: 1}],
]

chips_with_mmio: [
   0: 2,
   1: 3,
]

# harvest_mask is the bit indicating which tensix row is harvested. So bit 0 = first tensix row; bit 1 = second tensix row etc...
harvesting: {
   0: {noc_translation: true, harvest_mask: 65},
   1: {noc_translation: true, harvest_mask: 65},
   2: {noc_translation: true, harvest_mask: 513},
   3: {noc_translation: true, harvest_mask: 17},
}

# This value will be null if the boardtype is unknown, should never happen in practice but to be defensive it would be useful to throw an error on this case.
boardtype: {
   0: n300,
   1: n300,
   2: n300,
   3: n300,

I'm going to let @nhuang-tt do the rest of the investigation, but let me know if there's any info you could use.

nhuang-tt commented 3 weeks ago

As a simple sanity test, ./build_Debug/programming_examples/hello_world_compute_kernel ran into the assertion confirming it is something happening during init.

The cluster_desc.yaml file is only being used for non (GS, BH, Simulator) products so the issue seems to only be affecting Wormhole.

If I use the same cluster descriptor creation method as non Wormhole by making the changes shown below in tt_metal/llrt/tt_cluster.cpp:170 then the sanity test passes. So we need to find out how/where the UMD file is generated or we could not use it.

if (this->arch_ == tt::ARCH::GRAYSKULL or this->arch_ == tt::ARCH::BLACKHOLE or this->target_type_ == TargetDevice::Simulator) {

to

if (true) {

pgkeller commented 2 weeks ago

@dongjin-na a couple of questions: 1) How is your system physically connected? Are there ethernet connections between the two cards or are they two disconnected cards? 2) Have you run tt-topology?

@TTDRosen from the tt-smi output posted above, do you know how a system can get in this state (the coords do not look valid)?

TTDRosen commented 2 weeks ago

What we are looking at is the default state of the coordinates (pre tt-topology invocation). From what I'm seeing chip 0 is enumerated as /dev/tenstorrent/2 and chip 1 is enumerated as /dev/tenstorrent/3. Which can happen either due to manual shenanigans or if there are 4 cards on the system and they've been divided up with cgroups.

dongjin-na commented 2 weeks ago

@pgkeller, thanks for the check.

We installed two disconnected cards on the main board.
We haven’t run tt-topology yet.

Would an Ethernet connection be necessary to properly configure the topology using tt-topology?

joelsmithTT commented 2 weeks ago

@TTDRosen, I suspect the 2, 3 thing is due to IRD. Here's the output of create-ethernet-map on a dual N300 system without Docker, where the KMD has enumerated the cards as /dev/tenstorrent/0 and /dev/tenstorrent/1

arch: {
   0: Wormhole,
   1: Wormhole,
   2: Wormhole,
   3: Wormhole,
}

chips: {
   0: [0,0,0,0],
   1: [0,0,0,0],
   2: [1,0,0,0],
   3: [1,0,0,0],
}

ethernet_connections: [
   [{chip: 0, chan: 8}, {chip: 2, chan: 0}],
   [{chip: 0, chan: 9}, {chip: 2, chan: 1}],
   [{chip: 1, chan: 8}, {chip: 3, chan: 0}],
   [{chip: 1, chan: 9}, {chip: 3, chan: 1}],
]

chips_with_mmio: [
   0: 0,
   1: 1,
]

# harvest_mask is the bit indicating which tensix row is harvested. So bit 0 = first tensix row; bit 1 = second tensix row etc...
harvesting: {
   0: {noc_translation: true, harvest_mask: 65},
   1: {noc_translation: true, harvest_mask: 3},
   2: {noc_translation: true, harvest_mask: 9},
   3: {noc_translation: true, harvest_mask: 257},
}

# This value will be null if the boardtype is unknown, should never happen in practice but to be defensive it would be useful to throw an error on this case.
boardtype: {
   0: n300,
   1: n300,
   2: n300,
   3: n300,
}

razorback3 commented 2 weeks ago

Excuse me for interrupting, but what does IRD stand for?

joelsmithTT commented 2 weeks ago

@razorback3, Interactive Run Docker (IRD) is a wrapper around Docker we use internally. It can provide a development environment in which a subset of the accelerator devices are visible (e.g. software sees only /dev/tenstorrent/2 and /dev/tenstorrent/3).

pgkeller commented 2 weeks ago

@pgkeller, thanks for the check.

We installed two disconnected cards on the main board.

We haven’t run tt-topology yet.

Would an Ethernet connection be necessary to properly configure the topology using tt-topology?

Ethernet connection should not be required, I was just checking to see how things are connected because the information from tt-smi didn't seem valid. It looks like your issue will be fixed by running tt-topology and then we'll fix UMD to catch and report this problem with something useful.

dongjin-na commented 2 weeks ago

@pgkeller, Thank you for your reply. We’ll proceed with running tt-topology and check it. However, there is an issue with the mainboard, so we will check it once the repair is complete.

joelsmithTT commented 2 weeks ago

Actionable item for UMD here is to FATAL error with "Misconfigured system, run tt-topology" if it encounters this situation.

zzigler-tt commented 1 week ago

@joelsmithTT @broskoTT Checking in on the progress, do we have an ETA on the resolution for this issue?

broskoTT commented 1 week ago

@zzigler-tt Apologies for the late reply. The UMD currently doesn't support multiple clusters of unconnected chips. If the two n300 cards were connected by ethernet, it would've been supported. The way we internally use systems with multiple cards, is that we create dockers where we map only a single card to each of the images. Then each guest system sees only a single card.

I didn't get from the whole thread what would be the preferred resolution here? Is it only required from UMD to throw an appropriate error in case of misconfigured system? In that case, we can do the modification fairly quickly. Or is it required to implement support for such systems? In that case, we can do it in ~1 week, in case no additional problems arise along the way.

razorback3 commented 1 week ago

@broskoTT

Is it only required from UMD to throw an appropriate error in case of misconfigured system? In that case, we can do the modification fairly quickly. Or is it required to implement support for such systems? In that case, we can do it in ~1 week, in case no additional problems arise along the way.

Our requirement was supporting such system (i.e. multiple cards without ethernet).

broskoTT commented 1 week ago

@razorback3 thanks. There are two different problems here though:

UMD won't work well if we try to open multiple unconnected cards. This is not a bug, but a feature request since this was never supported. If we implement this, it would allow tt_metal to use both cards from the single host process. Fixing this would include changes only in UMD.
If you want to support multiple host processes, each of them running only on a set of chips, that is supported in UMD, and a change in tt_metal is required to use this, around here: https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/llrt/tt_cluster.cpp#L259.

It sounds like the request is to have two tt_metal runtimes, where each of them uses a single card? That is the second problem then, not the first one. UMD API already supports opening a subset of the cluster.

Even if UMD implemented the feature of supporting multiple unconnected cards, I don't think everything would work well if two tt_metal runtimes both open both cards. So this looks a change required in tt_metal. In this case, I agree with @joelsmithTT that it would save a lot of time if UMD could detect that the client (tt_metal) is trying to open multiple unconnected cards which is not supported currently, and throw error during initialization.

Not sure who should comment on this further from the metal side, @pgkeller ?

razorback3 commented 1 week ago

Our use-case falls in the second category. Multiple users are running their own process and each trying to use only one unconnected card.

razorback3 commented 1 week ago

As a side question, do you guys have any plan to implement a feature similar to CUDA_VISIBLE_DEVICES environment variable?

broskoTT commented 1 week ago

UMD should be supporting multiple unconnected cards (not thoroughly tested) when this gets merged https://github.com/tenstorrent/tt-umd/pull/306. Such problems as investigated in this issue wouldn't be seen anymore.

I will be moving the issue to tt_metal, since there is still an item on their part to properly allow users to open only a part of the the available system, which should already be available through UMD's constructor: https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/llrt/tt_cluster.cpp#L259 .

I would expect problems to arise if that is not done, and if two processes both try opening all the available devices. I'm also not completely sure if everything will work well even when that is done, UMD might still mishandle something (hugepages for example). We haven't tested multiple hosts on same machine.

As a side question, do you guys have any plan to implement a feature similar to CUDA_VISIBLE_DEVICES environment variable?

I think this would be more of a question for tt_metal, and how would they implement selecting a subset of devices for the client. The UMD currently supports selecting a subset through the API. If there is a request to also be able to select this from ENV variable, we could do that easily.

pgkeller commented 1 week ago

As a side question, do you guys have any plan to implement a feature similar to CUDA_VISIBLE_DEVICES environment variable?

This was not on our roadmap and hasn't been a priority internally. If this is a high priority need on your part we'll add it. Thanks

tenstorrent / tt-metal

Assertion Failure in Cluster::set_tunnels_from_mmio_device() When Two WH_B0 Cards are Installed in Host System #15101