Open dongjin-na opened 4 months ago
Now, it hangs on the same function Cluster::set_tunnels_from_mmio_device()
https://github.com/tenstorrent/tt-metal/blob/7c65c017a2d42ebfa805f767ed4ba7c9ac9a28a2/tt_metal/llrt/tt_cluster.cpp#L685-L700
@pgkeller this one was recently switched to a P0 - could someone take a look?
I got an IRD reservation with multiple chips. It looks like get_devices_controlled_by_mmio_device is assigning both ethernet-accessible devices to one mmio device, so I'll continue investigating.
The cluster description from .umd/cluster_desc.yaml
assigns the same coordinates and ethernet connections to multiple chips.
arch: {
0: Wormhole,
1: Wormhole,
2: Wormhole,
3: Wormhole,
}
chips: {
0: [0,0,0,0],
1: [0,0,0,0],
2: [1,0,0,0],
3: [1,0,0,0],
}
ethernet_connections: [
[{chip: 0, chan: 8}, {chip: 2, chan: 0}],
[{chip: 0, chan: 9}, {chip: 2, chan: 1}],
[{chip: 1, chan: 8}, {chip: 3, chan: 0}],
[{chip: 1, chan: 9}, {chip: 3, chan: 1}],
]
chips_with_mmio: [
0: 2,
1: 3,
]
# harvest_mask is the bit indicating which tensix row is harvested. So bit 0 = first tensix row; bit 1 = second tensix row etc...
harvesting: {
0: {noc_translation: true, harvest_mask: 65},
1: {noc_translation: true, harvest_mask: 65},
2: {noc_translation: true, harvest_mask: 513},
3: {noc_translation: true, harvest_mask: 17},
}
# This value will be null if the boardtype is unknown, should never happen in practice but to be defensive it would be useful to throw an error on this case.
boardtype: {
0: n300,
1: n300,
2: n300,
3: n300,
I'm going to let @nhuang-tt do the rest of the investigation, but let me know if there's any info you could use.
As a simple sanity test, ./build_Debug/programming_examples/hello_world_compute_kernel
ran into the assertion confirming it is something happening during init.
The cluster_desc.yaml file is only being used for non (GS, BH, Simulator) products so the issue seems to only be affecting Wormhole.
If I use the same cluster descriptor creation method as non Wormhole by making the changes shown below in tt_metal/llrt/tt_cluster.cpp:170 then the sanity test passes. So we need to find out how/where the UMD file is generated or we could not use it.
if (this->arch_ == tt::ARCH::GRAYSKULL or this->arch_ == tt::ARCH::BLACKHOLE or this->target_type_ == TargetDevice::Simulator) {
to
if (true) {
@dongjin-na a couple of questions: 1) How is your system physically connected? Are there ethernet connections between the two cards or are they two disconnected cards? 2) Have you run tt-topology?
@TTDRosen from the tt-smi output posted above, do you know how a system can get in this state (the coords do not look valid)?
What we are looking at is the default state of the coordinates (pre tt-topology invocation). From what I'm seeing chip 0 is enumerated as /dev/tenstorrent/2 and chip 1 is enumerated as /dev/tenstorrent/3. Which can happen either due to manual shenanigans or if there are 4 cards on the system and they've been divided up with cgroups.
@pgkeller, thanks for the check.
Would an Ethernet connection be necessary to properly configure the topology using tt-topology?
@TTDRosen, I suspect the 2, 3 thing is due to IRD. Here's the output of create-ethernet-map
on a dual N300 system without Docker, where the KMD has enumerated the cards as /dev/tenstorrent/0 and /dev/tenstorrent/1
arch: {
0: Wormhole,
1: Wormhole,
2: Wormhole,
3: Wormhole,
}
chips: {
0: [0,0,0,0],
1: [0,0,0,0],
2: [1,0,0,0],
3: [1,0,0,0],
}
ethernet_connections: [
[{chip: 0, chan: 8}, {chip: 2, chan: 0}],
[{chip: 0, chan: 9}, {chip: 2, chan: 1}],
[{chip: 1, chan: 8}, {chip: 3, chan: 0}],
[{chip: 1, chan: 9}, {chip: 3, chan: 1}],
]
chips_with_mmio: [
0: 0,
1: 1,
]
# harvest_mask is the bit indicating which tensix row is harvested. So bit 0 = first tensix row; bit 1 = second tensix row etc...
harvesting: {
0: {noc_translation: true, harvest_mask: 65},
1: {noc_translation: true, harvest_mask: 3},
2: {noc_translation: true, harvest_mask: 9},
3: {noc_translation: true, harvest_mask: 257},
}
# This value will be null if the boardtype is unknown, should never happen in practice but to be defensive it would be useful to throw an error on this case.
boardtype: {
0: n300,
1: n300,
2: n300,
3: n300,
}
Excuse me for interrupting, but what does IRD stand for?
@razorback3, Interactive Run Docker (IRD) is a wrapper around Docker we use internally. It can provide a development environment in which a subset of the accelerator devices are visible (e.g. software sees only /dev/tenstorrent/2 and /dev/tenstorrent/3).
@pgkeller, thanks for the check.
- We installed two disconnected cards on the main board.
- We haven’t run tt-topology yet.
Would an Ethernet connection be necessary to properly configure the topology using tt-topology?
Ethernet connection should not be required, I was just checking to see how things are connected because the information from tt-smi didn't seem valid. It looks like your issue will be fixed by running tt-topology and then we'll fix UMD to catch and report this problem with something useful.
@pgkeller, Thank you for your reply. We’ll proceed with running tt-topology and check it. However, there is an issue with the mainboard, so we will check it once the repair is complete.
Actionable item for UMD here is to FATAL error with "Misconfigured system, run tt-topology" if it encounters this situation.
@joelsmithTT @broskoTT Checking in on the progress, do we have an ETA on the resolution for this issue?
@zzigler-tt Apologies for the late reply. The UMD currently doesn't support multiple clusters of unconnected chips. If the two n300 cards were connected by ethernet, it would've been supported. The way we internally use systems with multiple cards, is that we create dockers where we map only a single card to each of the images. Then each guest system sees only a single card.
I didn't get from the whole thread what would be the preferred resolution here? Is it only required from UMD to throw an appropriate error in case of misconfigured system? In that case, we can do the modification fairly quickly. Or is it required to implement support for such systems? In that case, we can do it in ~1 week, in case no additional problems arise along the way.
@broskoTT
Is it only required from UMD to throw an appropriate error in case of misconfigured system? In that case, we can do the modification fairly quickly. Or is it required to implement support for such systems? In that case, we can do it in ~1 week, in case no additional problems arise along the way.
Our requirement was supporting such system (i.e. multiple cards without ethernet).
@razorback3 thanks. There are two different problems here though:
It sounds like the request is to have two tt_metal runtimes, where each of them uses a single card? That is the second problem then, not the first one. UMD API already supports opening a subset of the cluster.
Even if UMD implemented the feature of supporting multiple unconnected cards, I don't think everything would work well if two tt_metal runtimes both open both cards. So this looks a change required in tt_metal. In this case, I agree with @joelsmithTT that it would save a lot of time if UMD could detect that the client (tt_metal) is trying to open multiple unconnected cards which is not supported currently, and throw error during initialization.
Not sure who should comment on this further from the metal side, @pgkeller ?
Our use-case falls in the second category. Multiple users are running their own process and each trying to use only one unconnected card.
As a side question, do you guys have any plan to implement a feature similar to CUDA_VISIBLE_DEVICES
environment variable?
UMD should be supporting multiple unconnected cards (not thoroughly tested) when this gets merged https://github.com/tenstorrent/tt-umd/pull/306. Such problems as investigated in this issue wouldn't be seen anymore.
I will be moving the issue to tt_metal, since there is still an item on their part to properly allow users to open only a part of the the available system, which should already be available through UMD's constructor: https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/llrt/tt_cluster.cpp#L259 .
I would expect problems to arise if that is not done, and if two processes both try opening all the available devices. I'm also not completely sure if everything will work well even when that is done, UMD might still mishandle something (hugepages for example). We haven't tested multiple hosts on same machine.
As a side question, do you guys have any plan to implement a feature similar to
CUDA_VISIBLE_DEVICES
environment variable?
I think this would be more of a question for tt_metal, and how would they implement selecting a subset of devices for the client. The UMD currently supports selecting a subset through the API. If there is a request to also be able to select this from ENV variable, we could do that easily.
As a side question, do you guys have any plan to implement a feature similar to
CUDA_VISIBLE_DEVICES
environment variable?
This was not on our roadmap and hasn't been a priority internally. If this is a high priority need on your part we'll add it. Thanks
Describe the bug To improve development productivity, two WH_B0 cards were installed in the server so that two users can each use one WH_B0 card simultaneously for development. However, when running a program on this system, an assertion failure occurs in the
Cluster::set_tunnels_from_mmio_device()
function within the tt-metal runtime. (This function was merged in #9515). I would appreciate it if you could let me know if there are any settings I might have missed.To Reproduce Steps to reproduce the behavior:
Expected behavior
Screenshots
Aborted (core dumped)
(python_env) dongjin@ttdev01:~/tt-metal/build/test/tt_metal$ TT_METAL_SLOW_DISPATCH_MODE=1 ./unit_tests_frequent Running main() from /home/dongjin/tt-metal/.cpmcache/googletest/96129d89f45386492ae46d6bb8c027bc3df5f949/googletest/src/gtest_main.cc [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from Common [ RUN ] Common.AllCoresRunManyTimes Test | INFO | Running iteration #0 Device | INFO | Opening user mode device driver Detecting chips (found 4) 2024-07-16 22:05:43.962 | INFO | SiliconDriver - Detected 2 PCI devices : [0, 1] 2024-07-16 22:05:44.049 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0) 2024-07-16 22:05:44.056 | INFO | SiliconDriver - Detected 2 PCI devices : [0, 1] 2024-07-16 22:05:44.060 | WARNING | SiliconDriver - NumHostMemChannels: 2 used for device_id: 0x401e less than target: 3. Workload will fail if it exceeds NumHostMemChannels. Increase Number of Hugepages. 2024-07-16 22:05:44.088 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1) 2024-07-16 22:05:44.089 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2) 2024-07-16 22:05:44.090 | INFO | SiliconDriver - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3) Always | FATAL | Loop Exit Error. unknown file: Failure C++ exception with description "TT_ASSERT @ /home/dongjin/tt-metal/tt_metal/llrt/tt_cluster.cpp:656: tunneled_device_hit || (it == device_ids.end()) info: Loop Exit Error. backtrace: --- void tt::assert::tt_assert<char [17]>(char const, int, std::1::basic_string<char, std::__1::char_traits, std:: 1::allocator> const&, bool, char const , char const (&) [17])
--- tt::Cluster::set_tunnels_from_mmio_device()
--- tt::Cluster::Cluster()
--- tt::Cluster::instance()
--- tt::tt_metal::GetNumAvailableDevices()
--- ./unit_tests_frequent(+0xbd44) [0x5589a093bd44]
--- void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const)
--- void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const)
--- testing::Test::Run()
--- testing::TestInfo::Run()
--- testing::TestSuite::Run()
--- testing::internal::UnitTestImpl::RunAllTests()
--- bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const)
--- bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const)
--- testing::UnitTest::Run()
--- RUN_ALL_TESTS()
--- /home/dongjin/tt-metal/build/lib/libgtest_main.so.1.13.0(main+0x3d) [0x7fbc7018f18d]
--- /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fbc6facb083]
--- ./unit_tests_frequent(+0xaa5e) [0x5589a093aa5e]
" thrown in the test body.
[ FAILED ] Common.AllCoresRunManyTimes (258 ms)
[----------] 1 test from Common (258 ms total)
[----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (258 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] Common.AllCoresRunManyTimes
1 FAILED TEST
commit 354761a4cad252a45335854fa9466ff7b73904dc (HEAD -> main, origin/main, origin/HEAD) Author: Bharane AB bharane.amutha@multicorewareinc.com Date: Tue Jul 16 06:18:24 2024 +0000