[Bug] loading module to multiGPU hangs

chymian commented 7 months ago

🐛 Bug

System hangs during loading a CodeLlama-34b-Instruct-hf to multiple GPUs (4 x 3070 8GB)

this is a follow up from discord chat: https://discord.com/channels/1108078063931097149/1108079873999773746/1173238767117668353 @junrushao & @Sing-Li

To Reproduce

I was using the benchmark repo: https://github.com/mlc-ai/llm-perf-bench to build a cuda-docker image and the following exactly the steps in the readme.

Steps to reproduce the behavior:

same

did change the modelname and num-shards:

MODEL_NAME=CodeLlama-34b-Instruct-hf
QUANTIZATION=q4f16_1
NUM_SHARDS=4

compile was succesfull

IMPORTAND see Additional Context

loading the module while starting the benchmark or rest-api hangs the machine:


# PYTHONPATH=python python -m mlc_chat.rest --model $MODEL  --device cuda --host 0.0.0.0

python -m mlc_chat.cli.benchmark \ --model ${PATH_TEST}/params \ --device "cuda" \ --prompt "What is the meaning of life?" \ --generate-length 256

INFO: Started server process [787201] INFO: Waiting for application startup. [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #0] [PID 787201 TID 0x00007f4c3eb44640] started [12:06:28] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:165: Initializing nccl with devices: [0, 1, 2, 3] [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #1] [PID 787261 TID 0x00007f6dffc7b740] started [12:06:28] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #3] [PID 787263 TID 0x00007fcf25442740] started [12:06:29] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #2] [PID 787262 TID 0x00007f14a0dbd740] started

- nvidia-smi shows 100% CPU use, and VRAM got up to 395-435MB for all 4 GPUs and then stops there.
- CPU: all 4 cores 100%

have to manually sigkill all processes.

same behavior inside the docker image as on bare-metal.

## Expected behavior

load the model to the available GPUs

## Environment

 - Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): cuda 12.1
 - Operating system (e.g. Ubuntu/Windows/MacOS/...): ubu 22.04
 - Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) 4 x 3070 8GB)
 - How you installed MLC-LLM (`conda`, source): benchmark-docker-image
 - How you installed TVM-Unity (`pip`, source): benchmark-docker-image
 - Python version (e.g. 3.10): benchmark-docker-image
 - GPU driver version (if applicable): 545.23.06 
 - CUDA/cuDNN version (if applicable): 12.1 
 - TVM Unity Hash Tag (`python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"`, applicable if you compile models):
 - Any other relevant information:

## Additional context
__IMPORTAND:__

the model-config for CodeLlama-34b-Instruct-hf-q4f16_1  has a bug:

{ "model_lib": "CodeLlama-34b-hf-q4f16_1", "local_id": "CodeLlama-34b-Instruct-hf-q4f16_1",


it compiles a lib named after  `model_name`: CodeLlama-34b-Instruct-hf-q4f16_1-cuda.so 
but uses `model_lib`: CodeLlama-34b-hf-q4f16_1-cuda.so. it never finds its newly compiled lib.

change the line `model_lib` in the config file, to include the word `Instruct-`
`    "model_lib": "CodeLlama-34b-Instruct-hf-q4f16_1",`

EDIT:
added additional context

junrushao commented 7 months ago

the model-config for CodeLlama-34b-Instruct-hf-q4f16_1 has a bug:

It's actually a pretty bad design rather than a bug, which allows this model to use the same model lib with the base model CodeLlama-34b-hf. We will get rid of it in the next release.

loading the module while starting the benchmark or rest-api hangs the machine

This is definitely unexpected from my end. Will try to reproduce and get back to you later today. BTW, could you set the environment variable NCCL_DEBUG=INFO and paste here its full output in this thread? It will be very helpful for us to get additional context

chymian commented 7 months ago

I did recompile the lib, (docker.-image 2 days old) MODEL_NAME=CodeLlama-34b-Instruct-hf NCCL_VERSION=2.17.1-1 NCCL_DEBUG=INFO NUM_SHARDS=4

python -m mlc_chat.cli.benchmark    --model ${PATH_TEST}/params     --device "cuda"         --prompt "What is the meaning of life?"         --generate-length 256 | tee  nccl_debug_benchmark.log 2>&1
[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #0] [PID 1223345 TID 0x00007f4d42a36640] started
[12:27:11] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:165: Initializing nccl with devices: [0, 1, 2, 3]
utopia:1223345:1223345 [0] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:1223345:1223345 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:1223345:1223345 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:1223345:1223501 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.3+cuda12.1
[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #3] [PID 1223504 TID 0x00007f126d0a7740] started[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #2] [PID 1223503 TID 0x00007f5d4bb3e740] started

[12:27:11] /workspace/tvm/src/runtime/disco/worker.cc:65: [Worker #1] [PID 1223502 TID 0x00007f5d2f198740] started

I was searching the configured ubu-repos for that filenamelibnccl-net.so - not found

junrushao commented 7 months ago

I was still unable to reproduce this issue :((

chymian commented 7 months ago

I created a new benchmark docker image today and run the test.

time python -m mlc_chat.cli.benchmark --model /tmp/test//params --device cuda --prompt What is the meaning of life? --generate-length 256
utopia:350839:350839 [0] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350839:350839 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350839:350839 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350839:350843 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.3+cuda12.1
utopia:350845:350845 [2] NCCL INFO cudaDriverVersion 12030
utopia:350845:350845 [2] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350845:350845 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350845:350845 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350846:350846 [3] NCCL INFO cudaDriverVersion 12030
utopia:350846:350846 [3] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350846:350846 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350846:350846 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350844:350844 [1] NCCL INFO cudaDriverVersion 12030
utopia:350844:350844 [1] NCCL INFO Bootstrap : Using br0:192.168.178.17<0>
utopia:350844:350844 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
utopia:350844:350844 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
utopia:350839:350843 [0] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350839:350843 [0] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350839:350843 [0] NCCL INFO Using network Socket
utopia:350844:350844 [1] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350844:350844 [1] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350844:350844 [1] NCCL INFO Using network Socket
utopia:350846:350846 [3] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350846:350846 [3] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350846:350846 [3] NCCL INFO Using network Socket
utopia:350845:350845 [2] NCCL INFO Failed to open libibverbs.so[.1]
utopia:350845:350845 [2] NCCL INFO NET/Socket : Using [0]br0:192.168.178.17<0> [1]zt4mrrjgxa:10.11.1.17<0> [2]br-ba518786015b:172.18.0.1<0> [3]br-d3180bd8c5d7:172.27.0.1<0> [4]veth2bbc0c6:fe80::7cb4:48ff:fe43:8f5b%veth2bbc0c6<0>
utopia:350845:350845 [2] NCCL INFO Using network Socket
utopia:350839:350843 [0] NCCL INFO comm 0x7f8798929f30 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4d6b1616b168de55 - Init START
utopia:350844:350844 [1] NCCL INFO comm 0x558c95a2ff50 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 5000 commId 0x4d6b1616b168de55 - Init START
utopia:350846:350846 [3] NCCL INFO comm 0x563a7c661f20 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9000 commId 0x4d6b1616b168de55 - Init START
utopia:350845:350845 [2] NCCL INFO comm 0x563171d63fe0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 8000 commId 0x4d6b1616b168de55 - Init START
utopia:350844:350844 [1] NCCL INFO NVLS multicast support is not available on dev 1

utopia:350844:350844 [1] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order

utopia:350844:350844 [1] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350846:350846 [3] NCCL INFO NVLS multicast support is not available on dev 3

utopia:350846:350846 [3] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order

utopia:350846:350846 [3] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350845:350845 [2] NCCL INFO NVLS multicast support is not available on dev 2

utopia:350845:350845 [2] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order

utopia:350845:350845 [2] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350839:350843 [0] NCCL INFO NVLS multicast support is not available on dev 0

utopia:350839:350843 [0] graph/search.cc:960 NCCL WARN Could not find a path for pattern 4, falling back to simple order

utopia:350839:350843 [0] graph/search.cc:960 NCCL WARN Could not find a path for pattern 1, falling back to simple order
utopia:350839:350843 [0] NCCL INFO Channel 00/02 :    0   1   2   3
utopia:350846:350846 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
utopia:350845:350845 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
utopia:350846:350846 [3] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
utopia:350845:350845 [2] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO P2P Chunksize set to 131072
utopia:350839:350843 [0] NCCL INFO Channel 01/02 :    0   1   2   3
utopia:350839:350843 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
utopia:350839:350843 [0] NCCL INFO P2P Chunksize set to 131072
utopia:350844:350844 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC
utopia:350839:350843 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
utopia:350839:350843 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
utopia:350845:350845 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC
utopia:350845:350845 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC
utopia:350844:350844 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC
utopia:350846:350846 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct
utopia:350846:350846 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct
utopia:350844:350844 [1] NCCL INFO Connected all rings
utopia:350845:350845 [2] NCCL INFO Connected all rings
utopia:350846:350846 [3] NCCL INFO Connected all rings
utopia:350846:350846 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC
utopia:350839:350843 [0] NCCL INFO Connected all rings
utopia:350844:350844 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
utopia:350846:350846 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC
utopia:350844:350844 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
utopia:350839:350843 [0] NCCL INFO Connected all trees
utopia:350839:350843 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350839:350843 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350845:350845 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC
utopia:350845:350845 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC
utopia:350846:350846 [3] NCCL INFO Connected all trees
utopia:350845:350845 [2] NCCL INFO Connected all trees
utopia:350845:350845 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350845:350845 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350846:350846 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350846:350846 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
utopia:350844:350844 [1] NCCL INFO Connected all trees
utopia:350844:350844 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
utopia:350844:350844 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
Bus error (core dumped)

real    0m3.010s
user    0m1.543s
sys     0m1.187s

junrushao commented 7 months ago

OK the bus error issue is actually pretty common. This is because NCCL allocates a large chunk of shm. You may explicitly specify this by: https://github.com/mlc-ai/llm-perf-bench/blob/main/docker/bash.sh#L49

junrushao commented 7 months ago

Mildly increase it to somewhere around 1GB should work

chymian commented 7 months ago

I bumped incremental to 24G but still get the same error. tried with newly created images yesterday and today.

btw, there is a regression in the Mistral (which I was running as reference): see 1316

ereish64 commented 7 months ago

I ran into a similar issue recently with NCCL preventing the disco worker from starting. It would result in bizarre errors like "name "open" is not defined" which is curious since that is a built in python function...

The first issue I had was due to the fact that I had not built MLC-LLM relax with NCCL enabled in the cmake config. Which I had to install using apt (libnccl-dev and libnccl2)

However, when building relax, I received some new errors like: undefined reference to cudaGetDeviceProperties_v2 and undefined reference to cudaLaunchKernelExC in ncclLaunchKernel. As it turns out, those functions weren't added until CUDA 12.2.

@chymian I would try upgrading to CUDA 12.2 and if that doesn't work, building relax from source.

junrushao commented 7 months ago

However, when building relax, I received some new errors like: undefined reference to cudaGetDeviceProperties_v2 and undefined reference to cudaLaunchKernelExC in ncclLaunchKernel. As it turns out, those functions weren't added until CUDA 12.2.

@ereish64 I met this issue before as well. Usually it’s because NCCL’s version mismatches with CUDA runtime.

MasterJH5574 commented 3 months ago

Closing this issue due to inactivity. Please open new ones for anything to report or discuss.

mlc-ai / mlc-llm

[Bug] loading module to multiGPU hangs #1239

🐛 Bug

To Reproduce