[Question] Could not use multi GPU in chat

❓ General Questions

I have 4090 * 8 for my device:

Wed Nov 13 19:37:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:4F:00.0 Off |                  Off |
| 44%   27C    P8              18W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 44%   27C    P8              19W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 44%   27C    P8              17W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:57:00.0 Off |                  Off |
| 43%   28C    P8              15W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:CE:00.0 Off |                  Off |
| 45%   27C    P8              22W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 44%   27C    P8              13W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 43%   28C    P8              18W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:D6:00.0 Off |                  Off |
| 44%   28C    P8              15W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    4   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    5   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    6   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
|    7   N/A  N/A      2659      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

I am trying to run the model Liberated-Qwen1.5-14B with mlc-llm and this is how I convert the model and compile it.

1_convert_instruct_to_mlc.sh

mlc_llm convert_weight ./qwen/model14B/Liberated-Qwen1.5-14B \
    --quantization q0f16 \
    -o ./qwen-MLC/

2_generate_mlc_config.sh

mlc_llm gen_config ./qwen/model14B/Liberated-Qwen1.5-14B \
    --quantization q0f16 --conv-template redpajama_chat \
    -o ./qwen-MLC/ \
    --tensor-parallel-shards 8

3_compile_model.sh

mlc_llm compile ./qwen-MLC/mlc-chat-config.json \
    --overrides "tensor_parallel_shards=8" \
    --device cuda -o ./qwen-MLC/libs/cuda.so

chat.py

from mlc_llm import MLCEngine

engine = MLCEngine(model="/mnt/bit/sjr/qwen-MLC",
                   model_lib="/mnt/bit/sjr/qwen-MLC/libs/cuda.so")

engine.chat.completions.create(
    messages=[{"role": "user", "content": "hello"}]
)

And as issue https://github.com/mlc-ai/mlc-llm/issues/2562 mentioned, mlc only use CUDA:0 and showed OutOfMemory Error.

Logs are as follows:

generate config log:

./2_generate_mlc_config.sh 
[2024-11-13 19:37:18] INFO auto_config.py:116: Found model configuration: qwen/model14B/Liberated-Qwen1.5-14B/config.json
[2024-11-13 19:37:18] INFO auto_config.py:154: Found model type: qwen2. Use `--model-type` to override.
[2024-11-13 19:37:18] INFO qwen2_model.py:50: context_window_size not found in config.json. Falling back to max_position_embeddings (32768)
[2024-11-13 19:37:18] INFO qwen2_model.py:67: prefill_chunk_size defaults to 8192
[2024-11-13 19:37:18] INFO config.py:107: Overriding max_batch_size from 1 to 128
[2024-11-13 19:37:18] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 8
[2024-11-13 19:37:18] INFO gen_config.py:147: [generation_config.json] Setting bos_token_id: 151643
[2024-11-13 19:37:18] INFO gen_config.py:147: [generation_config.json] Setting eos_token_id: 151643
[2024-11-13 19:37:18] INFO gen_config.py:161: Not found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/tokenizer.model
[2024-11-13 19:37:18] INFO gen_config.py:159: Found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/tokenizer.json. Copying to qwen-MLC/tokenizer.json
[2024-11-13 19:37:18] INFO gen_config.py:159: Found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/vocab.json. Copying to qwen-MLC/vocab.json
[2024-11-13 19:37:18] INFO gen_config.py:159: Found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/merges.txt. Copying to qwen-MLC/merges.txt
[2024-11-13 19:37:18] INFO gen_config.py:159: Found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/added_tokens.json. Copying to qwen-MLC/added_tokens.json
[2024-11-13 19:37:18] INFO gen_config.py:159: Found tokenizer config: qwen/model14B/Liberated-Qwen1.5-14B/tokenizer_config.json. Copying to qwen-MLC/tokenizer_config.json
[2024-11-13 19:37:18] INFO gen_config.py:220: Detected tokenizer info: {'token_postproc_method': 'byte_level', 'prepend_space_in_encode': False, 'strip_space_in_decode': False}
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting pad_token_id: 0
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting temperature: 1.0
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting presence_penalty: 0.0
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting frequency_penalty: 0.0
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting repetition_penalty: 1.0
[2024-11-13 19:37:18] INFO gen_config.py:32: [System default] Setting top_p: 1.0
[2024-11-13 19:37:18] INFO gen_config.py:248: Dumping configuration file to: qwen-MLC/mlc-chat-config.json

compile log:

./3_compile_model.sh 
[2024-11-13 21:05:38] INFO auto_config.py:70: Found model configuration: qwen-MLC/mlc-chat-config.json
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:0
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:1
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:2
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:3
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:4
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:5
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:6
[2024-11-13 21:05:39] INFO auto_device.py:79: Found device: cuda:7
[2024-11-13 21:05:39] INFO auto_target.py:78: Found configuration of target device "cuda:0": {"thread_warp_size": runtime.BoxInt(32), "arch": "sm_89", "max_threads_per_block": runtime.BoxInt(1024), "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]}
[2024-11-13 21:05:39] INFO auto_target.py:110: Found host LLVM triple: x86_64-redhat-linux-gnu
[2024-11-13 21:05:39] INFO auto_target.py:111: Found host LLVM CPU: icelake-server
[2024-11-13 21:05:39] INFO auto_target.py:334: Generating code for CUDA architecture: sm_89
[2024-11-13 21:05:39] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a
[2024-11-13 21:05:39] INFO auto_config.py:154: Found model type: qwen2. Use `--model-type` to override.
Compiling with arguments:
  --config          QWen2Config(hidden_act='silu', hidden_size=5120, intermediate_size=13696, num_attention_heads=40, num_hidden_layers=40, num_key_value_heads=40, rms_norm_eps=1e-06, rope_theta=1000000.0, vocab_size=152064, tie_word_embeddings=False, context_window_size=32768, prefill_chunk_size=8192, tensor_parallel_shards=8, head_dim=128, dtype='float32', max_batch_size=128, kwargs={})
  --quantization    NoQuantize(name='q0f16', kind='no-quant', model_dtype='float16')
  --model-type      qwen2
  --target          {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "icelake-server", "keys": ["cpu"]}, "arch": "sm_89", "max_threads_per_block": runtime.BoxInt(1024), "libs": ["thrust"], "max_num_threads": runtime.BoxInt(1024), "kind": "cuda", "max_shared_memory_per_block": runtime.BoxInt(49152), "tag": "", "keys": ["cuda", "gpu"]}
  --opt             flashinfer=1;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          qwen-MLC/libs/cuda.so
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=8;pipeline_parallel_stages=None
[2024-11-13 21:05:39] INFO config.py:107: Overriding tensor_parallel_shards from 8 to 8
[2024-11-13 21:05:39] INFO compile.py:140: Creating model from: QWen2Config(hidden_act='silu', hidden_size=5120, intermediate_size=13696, num_attention_heads=40, num_hidden_layers=40, num_key_value_heads=40, rms_norm_eps=1e-06, rope_theta=1000000.0, vocab_size=152064, tie_word_embeddings=False, context_window_size=32768, prefill_chunk_size=8192, tensor_parallel_shards=8, head_dim=128, dtype='float32', max_batch_size=128, kwargs={})
[2024-11-13 21:05:39] INFO compile.py:158: Exporting the model to TVM Unity compiler
[2024-11-13 21:05:41] INFO compile.py:164: Running optimizations using TVM Unity
[2024-11-13 21:05:41] INFO compile.py:185: Registering metadata: {'model_type': 'qwen2', 'quantization': 'q0f16', 'context_window_size': 32768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 8192, 'tensor_parallel_shards': 8, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}
[2024-11-13 21:05:42] INFO pipeline.py:54: Running TVM Relax graph-level optimizations
[2024-11-13 21:05:47] INFO pipeline.py:54: Lowering to TVM TIR kernels
[2024-11-13 21:05:52] INFO pipeline.py:54: Running TVM TIR-level optimizations
[2024-11-13 21:06:02] INFO pipeline.py:54: Running TVM Dlight low-level optimizations
[2024-11-13 21:06:02] INFO pipeline.py:54: Lowering to VM bytecode
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 80.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `argsort_probs`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 116.38 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 395.50 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 7448.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_flashinfer_paged_kv_cache`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.62 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 80.03 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `multinomial_from_uniform`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 320.88 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `renormalize_by_top_p`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `sample_with_top_p`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_take_probs`: 0.01 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_verify_draft_tokens`: 0.00 MB
[2024-11-13 21:06:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-11-13 21:06:08] INFO pipeline.py:54: Compiling external modules
[2024-11-13 21:06:08] INFO pipeline.py:54: Compilation complete! Exporting to disk
[2024-11-13 21:06:18] INFO model_metadata.py:95: Total memory usage without KV cache:: 13425.19 MB (Parameters: 5977.19 MB. Temporary buffer: 7448.00 MB)
[2024-11-13 21:06:18] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-11-13 21:06:18] INFO compile.py:207: Generated: qwen-MLC/libs/cuda.so

chat.py log:

python chat.py 
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:0
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:1
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:2
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:3
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:4
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:5
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:6
[2024-11-13 21:10:12] INFO auto_device.py:79: Found device: cuda:7
[2024-11-13 21:10:13] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-13 21:10:14] INFO auto_device.py:88: Not found device: metal:0
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:1
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:2
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:3
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:4
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:5
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:6
[2024-11-13 21:10:16] INFO auto_device.py:79: Found device: vulkan:7
[2024-11-13 21:10:18] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-13 21:10:18] INFO auto_device.py:35: Using device: cuda:0
[2024-11-13 21:10:18] INFO engine_base.py:143: Using library model: /mnt/bit/sjr/qwen-MLC/libs/cuda.so
[2024-11-13 21:10:18] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-11-13 21:10:18] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-11-13 21:10:18] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:613: temp_buffer = 14896.000
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:614: kv_aux workspace = 88.375
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:615: model workspace = 20.036
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:616: logit processor workspace = 9.354
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:613: temp_buffer = 14896.000
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:614: kv_aux workspace = 88.375
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:615: model workspace = 20.032
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:616: logit processor workspace = 2.338
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:613: temp_buffer = 14896.000
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:614: kv_aux workspace = 88.386
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:615: model workspace = 20.188
[21:10:21] /workspace/mlc-llm/cpp/serve/config.cc:616: logit processor workspace = 299.323
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/qwen_mlc/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/user/anaconda3/envs/qwen_mlc/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/user/anaconda3/envs/qwen_mlc/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 169, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
  File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 283, in mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
tvm._ffi.base.TVMError: Traceback (most recent call last):
  1: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:169
  0: mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:283
  File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 283
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 20584.716 MB, which is less than the sum of model weight size (5977.188 MB) and temporary buffer size (15013.765 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.

I noticed that even though I set --overrides "tensor_parallel_shards=8"

the chat.py log still says: [2024-11-13 21:10:18] INFO auto_device.py:35: Using device: cuda:0

It only uses cuda:0 device.

mlc-ai / mlc-llm

[Question] Could not use multi GPU in chat #3024

❓ General Questions