Closed BobH233 closed 1 week ago
Hi @BobH233 thank you for reporting. It is using multiple GPUs. The error is that the model does not fit all GPUs with the default gpu_memory_utilization
value:
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 20584.716 MB, which is less than the sum of model weight size (5977.188 MB) and temporary buffer size (15013.765 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
Could you try a larger gpu_memory_utilization
? The default value is 0.85. For example you can try
from mlc_llm import MLCEngine
from mlc_llm.serve import EngineConfig
engine = MLCEngine(model="/mnt/bit/sjr/qwen-MLC",
model_lib="/mnt/bit/sjr/qwen-MLC/libs/cuda.so",
engine_config=EngineConfig(gpu_memory_utilization=0.88))
engine.chat.completions.create(
messages=[{"role": "user", "content": "hello"}]
)
Thank you for your explaination, it works!
Thanks! Glad that it works out.
❓ General Questions
I have 4090 * 8 for my device:
I am trying to run the model
Liberated-Qwen1.5-14B
with mlc-llm and this is how I convert the model and compile it.1_convert_instruct_to_mlc.sh
2_generate_mlc_config.sh
3_compile_model.sh
chat.py
And as issue https://github.com/mlc-ai/mlc-llm/issues/2562 mentioned, mlc only use CUDA:0 and showed OutOfMemory Error.
Logs are as follows:
generate config log
:compile log
:chat.py
log:I noticed that even though I set
--overrides "tensor_parallel_shards=8"
the
chat.py log
still says:[2024-11-13 21:10:18] INFO auto_device.py:35: Using device: cuda:0
It only uses cuda:0 device.