Closed chymian closed 3 months ago
the model-config for CodeLlama-34b-Instruct-hf-q4f16_1 has a bug:
It's actually a pretty bad design rather than a bug, which allows this model to use the same model lib with the base model CodeLlama-34b-hf. We will get rid of it in the next release.
loading the module while starting the benchmark or rest-api hangs the machine
This is definitely unexpected from my end. Will try to reproduce and get back to you later today. BTW, could you set the environment variable NCCL_DEBUG=INFO
and paste here its full output in this thread? It will be very helpful for us to get additional context
I did recompile the lib, (docker.-image 2 days old) MODEL_NAME=CodeLlama-34b-Instruct-hf NCCL_VERSION=2.17.1-1 NCCL_DEBUG=INFO NUM_SHARDS=4
python -m mlc_chat.cli.benchmark --model ${PATH_TEST}/params --device "cuda" --prompt "What is the meaning of life?" --generate-length 256 | tee nccl_debug_benchmark.log 2>&1
[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #0] [PID 1223345 TID 0x00007f4d42a36640] started
[12:27:11] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:165: Initializing nccl with devices: [0, 1, 2, 3]
utopia:1223345:1223345 [0] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:1223345:1223345 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:1223345:1223345 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:1223345:1223501 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.3+cuda12.1
[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #3] [PID 1223504 TID 0x00007f126d0a7740] started[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #2] [PID 1223503 TID 0x00007f5d4bb3e740] started
[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #1] [PID 1223502 TID 0x00007f5d2f198740] started
I was searching the configured ubu-repos for that filenamelibnccl-net.so
- not found
I was still unable to reproduce this issue :((
I created a new benchmark docker image today and run the test.
time python -m mlc_chat.cli.benchmark --model /tmp/test//params --device cuda --prompt What is the meaning of life? --generate-length 256
utopia:350839:350839 [0] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350839:350839 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350839:350839 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350839:350843 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.3+cuda12.1
utopia:350845:350845 [2] NCCL INFO cudaDriverVersion 12030
utopia:350845:350845 [2] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350845:350845 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350845:350845 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350846:350846 [3] NCCL INFO cudaDriverVersion 12030
utopia:350846:350846 [3] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350846:350846 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350846:350846 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350844:350844 [1] NCCL INFO cudaDriverVersion 12030
utopia:350844:350844 [1] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350844:350844 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350844:350844 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350839:350843 [0] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350839:350843 [0] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350839:350843 [0] NCCL INFO Using network Socket
utopia:350844:350844 [1] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350844:350844 [1] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350844:350844 [1] NCCL INFO Using network Socket
utopia:350846:350846 [3] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350846:350846 [3] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350846:350846 [3] NCCL INFO Using network Socket
utopia:350845:350845 [2] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350845:350845 [2] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350845:350845 [2] NCCL INFO Using network Socket
utopia:350839:350843 [0] NCCL INFO comm 0x7f8798929f30 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4d6b1616b168de55 - Init START
utopia:350844:350844 [1] NCCL INFO comm 0x558c95a2ff50 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 5000 commId 0x4d6b1616b168de55 - Init START
utopia:350846:350846 [3] NCCL INFO comm 0x563a7c661f20 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9000 commId 0x4d6b1616b168de55 - Init START
utopia:350845:350845 [2] NCCL INFO comm 0x563171d63fe0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 8000 commId 0x4d6b1616b168de55 - Init START
utopia:350844:350844 [1] NCCL INFO NVLS multicast support is not available on dev 1
utopia:350844:350844 [1] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order
utopia:350844:350844 [1] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350846:350846 [3] NCCL INFO NVLS multicast support is not available on dev 3
utopia:350846:350846 [3] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order
utopia:350846:350846 [3] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350845:350845 [2] NCCL INFO NVLS multicast support is not available on dev 2
utopia:350845:350845 [2] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order
utopia:350845:350845 [2] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350839:350843 [0] NCCL INFO NVLS multicast support is not available on dev 0
utopia:350839:350843 [0] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order
utopia:350839:350843 [0] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350839:350843 [0] NCCL INFO Channel 00/02 : 0 1 2 3
utopia:350846:350846 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
utopia:350845:350845 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
utopia:350846:350846 [3] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
utopia:350845:350845 [2] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO P2P Chunksize set to 131072
utopia:350839:350843 [0] NCCL INFO Channel 01/02 : 0 1 2 3
utopia:350839:350843 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
utopia:350839:350843 [0] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC
utopia:350839:350843 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
utopia:350839:350843 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
utopia:350845:350845 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC
utopia:350845:350845 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC
utopia:350844:350844 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC
utopia:350846:350846 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct
utopia:350846:350846 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
utopia:350844:350844 [1] NCCL INFO Connected all rings
utopia:350845:350845 [2] NCCL INFO Connected all rings
utopia:350846:350846 [3] NCCL INFO Connected all rings
utopia:350846:350846 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC
utopia:350839:350843 [0] NCCL INFO Connected all rings
utopia:350844:350844 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
utopia:350846:350846 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC
utopia:350844:350844 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
utopia:350839:350843 [0] NCCL INFO Connected all trees
utopia:350839:350843 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350839:350843 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350845:350845 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC
utopia:350845:350845 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC
utopia:350846:350846 [3] NCCL INFO Connected all trees
utopia:350845:350845 [2] NCCL INFO Connected all trees
utopia:350845:350845 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350845:350845 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350846:350846 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350846:350846 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350844:350844 [1] NCCL INFO Connected all trees
utopia:350844:350844 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350844:350844 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
Bus error (core dumped)
real 0m3.010s
user 0m1.543s
sys 0m1.187s
OK the bus error issue is actually pretty common. This is because NCCL allocates a large chunk of shm. You may explicitly specify this by: https://github.com/mlc-ai/llm-perf-bench/blob/main/docker/bash.sh#L49
Mildly increase it to somewhere around 1GB should work
I bumped incremental to 24G but still get the same error. tried with newly created images yesterday and today.
btw, there is a regression in the Mistral (which I was running as reference): see 1316
I ran into a similar issue recently with NCCL preventing the disco worker from starting. It would result in bizarre errors like "name "open" is not defined" which is curious since that is a built in python function...
The first issue I had was due to the fact that I had not built MLC-LLM relax with NCCL enabled in the cmake config. Which I had to install using apt (libnccl-dev and libnccl2)
However, when building relax, I received some new errors like: undefined reference to cudaGetDeviceProperties_v2
and undefined reference to cudaLaunchKernelExC
in ncclLaunchKernel. As it turns out, those functions weren't added until CUDA 12.2.
@chymian I would try upgrading to CUDA 12.2 and if that doesn't work, building relax from source.
However, when building relax, I received some new errors like: undefined reference to cudaGetDeviceProperties_v2 and undefined reference to cudaLaunchKernelExC in ncclLaunchKernel. As it turns out, those functions weren't added until CUDA 12.2.
@ereish64 I met this issue before as well. Usually it’s because NCCL’s version mismatches with CUDA runtime.
Closing this issue due to inactivity. Please open new ones for anything to report or discuss.
🐛 Bug
System hangs during loading a CodeLlama-34b-Instruct-hf to multiple GPUs (4 x 3070 8GB)
this is a follow up from discord chat: https://discord.com/channels/1108078063931097149/1108079873999773746/1173238767117668353 @junrushao & @Sing-Li
To Reproduce
I was using the benchmark repo:
https://github.com/mlc-ai/llm-perf-bench
to build a cuda-docker image and the following exactly the steps in the readme.Steps to reproduce the behavior:
IMPORTAND see Additional Context
python -m mlc_chat.cli.benchmark \ --model ${PATH_TEST}/params \ --device "cuda" \ --prompt "What is the meaning of life?" \ --generate-length 256
INFO: Started server process [787201] INFO: Waiting for application startup. [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #0] [PID 787201 TID 0x00007f4c3eb44640] started [12:06:28] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:165: Initializing nccl with devices: [0, 1, 2, 3] [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #1] [PID 787261 TID 0x00007f6dffc7b740] started [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #3] [PID 787263 TID 0x00007fcf25442740] started [12:06:29] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #2] [PID 787262 TID 0x00007f14a0dbd740] started
{ "model_lib": "CodeLlama-34b-hf-q4f16_1", "local_id": "CodeLlama-34b-Instruct-hf-q4f16_1",