vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.09k stars 3.28k forks source link

TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ with tensor parallelism using 2 A10 GPU's #2395

Open PhaneendraGunda opened 6 months ago

PhaneendraGunda commented 6 months ago

Hi,

I was able to run TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ model on 2 A10 gpus on AWS Sagemaker. I was using ml.g5.12xlarge instance type. Command to run the code

python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --dtype float16 --max-model-len 1024 --gpu-memory-utilization 0.95 --tensor-parallel-size 2

But when I am trying to run the same model on AWS EKS k8's pod with 2 A10 GPUs, it's getting failed. It was throwing following error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 45.00 MiB is free. Process 40317 has 21.92 GiB memory in use. Of the allocated memory 20.02 GiB is allocated by PyTorch, and 260.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

After debugging, i realized that the model is running in the eager_mode. Here is the working command

python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --dtype float16 --max-model-len 1024 --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --enforce-eager

But I don't see any difference in the above two scenarios. In one scenario, model is running in cude graph mode and another scenario it is running only in eager_mode( which will reduce performance in production setup).

Am I missing anything in here ?

arkohut commented 6 months ago

https://github.com/vllm-project/vllm/issues/2413

This may be helpful.

PhaneendraGunda commented 6 months ago

2413

This may be helpful.

Thanks @arkohut. Yes, I was able to run with the eager_mode but it will effects the inference latency compared to CUDA Graph. I am getting the same CUDA error even when I am trying with 4 GPUs which is strange.

Praj-23 commented 4 months ago

Hi @PhaneendraGunda Can you share the complete script you used for deploying the model? Thanks in Advance

Abhishekghosh1998 commented 4 months ago

I am getting the same CUDA error even when I am trying with 4 GPUs which is strange.

@PhaneendraGunda just for clarification, what is the "same CUDA error" here? Do you face the same OOM issue even when using 4 GPUs?