[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge

Expected behavior

MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring. Issue is confirmed only with the bridge enabled, adding amdgpu.use_xgmi_p2p=0 to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.

Here is the output when attempting to run with NCCL_DEBUG=INFO screenlog.txt

Actual behavior

/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda failure 'invalid argument'

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Stack trace:
  0: _ZN3tvm7runtime6deta
  1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
  4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
  5: 0x00007ff61c0dc252
  6: start_thread
        at ./nptl/pthread_create.c:442
  7: 0x00007ff64cd2665f
        at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  8: 0xffffffffffffffff

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0 Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04 Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100 How you installed MLC-LLM (conda, source): conda How you installed TVM-Unity (pip, source): pip Python version (e.g. 3.10): 3.10.12 TVM Unity Hash Tag: unity.txt

Steps to reproduce

Install MLC-LLM Run the following python code to start loading/inferring

cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig(
    max_gen_len=4096,
    conv_template="LM",
    temperature=0.75,
    repetition_penalty=1.1,
    top_p=0.9,
    tensor_parallel_shards=4,
    context_window_size=4096
))

output = cm.generate(
    prompt="What is the meaning of life?",
    progress_callback=StreamToStdout(callback_interval=2),
)

mlc-ai / relax