MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring.
Issue is confirmed only with the bridge enabled, adding amdgpu.use_xgmi_p2p=0 to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.
Here is the output when attempting to run with NCCL_DEBUG=INFOscreenlog.txt
Actual behavior
/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda failure 'invalid argument'
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Stack trace:
0: _ZN3tvm7runtime6deta
1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
5: 0x00007ff61c0dc252
6: start_thread
at ./nptl/pthread_create.c:442
7: 0x00007ff64cd2665f
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
8: 0xffffffffffffffff
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100
How you installed MLC-LLM (conda, source): conda
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.10.12
TVM Unity Hash Tag: unity.txt
Steps to reproduce
Install MLC-LLM
Run the following python code to start loading/inferring
cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig(
max_gen_len=4096,
conv_template="LM",
temperature=0.75,
repetition_penalty=1.1,
top_p=0.9,
tensor_parallel_shards=4,
context_window_size=4096
))
output = cm.generate(
prompt="What is the meaning of life?",
progress_callback=StreamToStdout(callback_interval=2),
)
Expected behavior
MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring. Issue is confirmed only with the bridge enabled, adding
amdgpu.use_xgmi_p2p=0
to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.Here is the output when attempting to run with
NCCL_DEBUG=INFO
screenlog.txtActual behavior
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0 Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04 Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100 How you installed MLC-LLM (conda, source): conda How you installed TVM-Unity (pip, source): pip Python version (e.g. 3.10): 3.10.12 TVM Unity Hash Tag: unity.txt
Steps to reproduce
Install MLC-LLM Run the following python code to start loading/inferring