[Bug] unhandled cuda error with ROCm 5.7

JackBinary commented 5 months ago

🐛 Bug

Attempting to run Mixtral 8x7 at q4f16_1 split across 2 16GB GPUs, I get an unhandled cuda error from rccl.

To Reproduce

Steps to reproduce the behavior:

Install MLC-LLM
Convert Mixtral Weights
Compile Mixtral for ROCm with --tensor-parallel-shard 2

Attempt to run the following code w/ NCCL_DEBUG=INFO:


from mlc_llm import ChatModule
cm = ChatModule(model="./dist/Mixtral-8x7B-Instruct-v0.1-MLC", \
model_lib_path="./dist/libs/Mixtral-8x7B-Instruct-v0.1-q4f16_1-rocm.so", device="rocm")

while True: prompt = input("> ") cm.generate("prompt")


Produces the following log:

[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [2024-04-18 15:41:13] INFO auto_device.py:76: [92mFound[0m device: rocm:0 [2024-04-18 15:41:13] INFO auto_device.py:76: [92mFound[0m device: rocm:1 [2024-04-18 15:41:13] INFO chat_module.py:379: Using model folder: /home/user/mlc/dist/Mixtral-8x7B-Instruct-v0.1-MLC [2024-04-18 15:41:13] INFO chat_module.py:380: Using mlc chat config: /home/user/mlc/dist/Mixtral-8x7B-Instruct-v0.1-MLC/mlc-chat-config.json [2024-04-18 15:41:13] INFO chat_module.py:529: Using library model: ./dist/libs/Mixtral-8x7B-Instruct-v0.1-q4f16_1-rocm.so [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [2024-04-18 15:41:13] INFO model_metadata.py:96: [92mTotal memory usage[0m: 15417.07 MB (Parameters: 12599.13 MB. KVCache: 0.00 MB. Temporary buffer: 2817.94 MB) [2024-04-18 15:41:13] INFO model_metadata.py:105: To reduce memory usage, tweak prefill_chunk_size, context_window_size and sliding_window_size localhost:7497:7497 [0] NCCL INFO Bootstrap : Using enp12s0:192.168.1.236<0> localhost:7497:7497 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory localhost:7497:7497 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation localhost:7497:7497 [0] NCCL INFO Kernel version: 6.5.0-27-generic localhost:7497:7578 [0] NCCL INFO ROCr version 1.1 localhost:7497:7578 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1 RCCL version 2.17.1+hip5.7 HEAD:3d014cc+ [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. [15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets. terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [15:41:14] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:87: rcclErrror: unhandled cuda error Stack trace: 0: _ZN3tvm7runtime6deta 1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::cxx11::basic_string<char, std::char_traits, std::allocator >) 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::cxx11::basic_string<char, std::char_traits, std::allocator >)>::AssignTypedLambda<void ()(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)>(void ()(tvm::runtime::ShapeTuple, std::cxx11::basic_string<char, std::char_traits, std::allocator >), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*) 5: execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:104 6: start_thread at ./nptl/pthread_create.c:442 7: 0x000076773652684f at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 8: 0xffffffffffffffff

localhost:7579:7579 [1] NCCL INFO ROCr version 1.1 localhost:7579:7579 [1] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1 localhost:7579:7579 [1] NCCL INFO Bootstrap : Using enp12s0:192.168.1.236<0> localhost:7579:7579 [1] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 2 : librccl-net.so: cannot open shared object file: No such file or directory localhost:7579:7579 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation localhost:7579:7579 [1] NCCL INFO Kernel version: 6.5.0-27-generic localhost:7579:7579 [1] NCCL INFO Failed to open libibverbs.so[.1] localhost:7579:7579 [1] NCCL INFO NET/Socket : Using [0]enp12s0:192.168.1.236<0> localhost:7579:7579 [1] NCCL INFO Using network Socket localhost:7579:7579 [1] NCCL INFO rocm_smi_lib: version 5.0.0.0 localhost:7579:7579 [1] NCCL INFO Setting affinity for GPU 1 to ffffff localhost:7579:7579 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 comm 0x3726740 nRanks 02 busId 6000 localhost:7579:7579 [1] NCCL INFO P2P Chunksize set to 131072 localhost:7579:7579 [1] NCCL INFO Channel 00/0 : 1[6000] -> 0[3000] via P2P/IPC comm 0x3726740 nRanks 02 localhost:7579:7579 [1] NCCL INFO Channel 01/0 : 1[6000] -> 0[3000] via P2P/IPC comm 0x3726740 nRanks 02

localhost:7579:7579 [1] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:228 NCCL WARN Cuda failure 'invalid device pointer' localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:342 -> 1 localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport.cc:160 -> 1 localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1269 -> 1 localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1500 -> 1 localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1701 -> 1 localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1738 -> 1 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 51, in main() File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 46, in main worker_func(worker_id, num_workers, reader, writer) File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm._ffi.base.TVMError: Traceback (most recent call last): 6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, long, long)>::AssignTypedLambda<void ()(int, int, long, long)>(void ()(int, int, long, long), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue) 5: tvm::runtime::WorkerProcess(int, int, long, long) 4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker) 3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::cxx11::basic_string<char, std::char_traits, std::allocator >)>::AssignTypedLambda<void ()(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)>(void ()(tvm::runtime::ShapeTuple, std::cxx11::basic_string<char, std::char_traits, std::allocator >), std::cxx11::basic_string<char, std::char_traits, std::allocator >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) 0: _ZN3tvm7runtime6deta File "/workspace/tvm/src/runtime/disco/nccl/nccl.cc", line 87 rcclErrror: unhandled cuda error



## Expected behavior

Initiate a chat with Mixtral 8x7b for testing

## Environment

 - Platform: ROCM 5.7
 - Operating system: Ubuntu 22.04.4
 - Device: 2x Radeon Instinct Mi-25 16GB
 - How you installed MLC-LLM: conda
 - How you installed TVM-Unity: pip
 - Python version (e.g. 3.10): 3.11
 - TVM Unity Hash Tag: 0c81069ea42f393f6ff24efc47f15bc2316cfb10

## Additional context

fresh install of Ubuntu 22.04.4 with kernel 6.5.0-27-generic

TNT3530 commented 5 months ago

Good to know that is also happens on non-bridged GPUs https://github.com/mlc-ai/relax/issues/317

Sing-Li commented 5 months ago

Thanks for reporting. Let's hope we get the upstream fix soon 🤞 Also waiting https://github.com/mlc-ai/mlc-llm/issues/2144

MasterJH5574 commented 1 month ago

Just want to put a note here that we've bumped the ROCm support to 6.1/6.2 and you are welcome to try out the prebuilt mlc packages at https://llm.mlc.ai/docs/install/mlc_llm.html#option-1-prebuilt-package

TNT3530 commented 1 month ago

Just want to put a note here that we've bumped the ROCm support to 6.1/6.2 and you are welcome to try out the prebuilt mlc packages at https://llm.mlc.ai/docs/install/mlc_llm.html#option-1-prebuilt-package

Can confirm that the latest mlc 0.15.dev544 fixes it.

mlc-ai / mlc-llm

[Bug] unhandled cuda error with ROCm 5.7 #2160

🐛 Bug

To Reproduce