[Question] Cannot serve model using multi GPU

ro99 commented 3 days ago

❓ General Questions

I am trying to serve a model using 4 GPUs but I keep getting the following error:

TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 23669.327 MB, which is less than the sum of model weight size (17851.479 MB) and temporary buffer size (15275.630 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.

The model is the Qwen2.5-Coder-32B-Instruct that have ~61GB in size. I believe it should fit:

Thu Nov 14 17:56:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:03:00.0  On |                  N/A |
|  0%   46C    P8             39W /  280W |     172MiB /  24576MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   38C    P8             29W /  280W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:83:00.0 Off |                  N/A |
|  0%   39C    P8             30W /  280W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:84:00.0 Off |                  N/A |
|  0%   34C    P8             24W /  280W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1334      G   /usr/lib/xorg/Xorg                            161MiB |
|    1   N/A  N/A      1334      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      1334      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      1334      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

I am trying with many different overrides combination, but no luck. The last that I tried is the following:

mlc_llm serve ./qwen-MLC --model-lib ./qwen-MLC/libs/cuda.so --device cuda --overrides "max_num_sequence=32;max_total_seq_length=2048;gpu_memory_utilization=0.98;tensor_parallel_shards=4" --host 0.0.0.0 --port 5005

I am running this on Debian 12. The quantization used is the q0f16 with --tensor-parallel-shards 4 option.

MasterJH5574 commented 3 days ago

Hi @ro99 thanks for asking. May I ask which prefill chunk size did you use? As the error message has suggested, it might be the prefill chunk size being too large. So one possibility is to try

--overrides "prefill_chunk_size=2048;tensor_parallel_shards=4"

ro99 commented 2 days ago

Hi @MasterJH5574 , thank you for checking this. I tried with different sizes for prefill_chunk_size but no luck, same error.

I will test with other models and let you know how it goes.

mlc-ai / mlc-llm

[Question] Cannot serve model using multi GPU #3027

❓ General Questions