Open PhaneendraGunda opened 6 months ago
https://github.com/vllm-project/vllm/issues/2413
This may be helpful.
2413
This may be helpful.
Thanks @arkohut. Yes, I was able to run with the eager_mode but it will effects the inference latency compared to CUDA Graph. I am getting the same CUDA error even when I am trying with 4 GPUs which is strange.
Hi @PhaneendraGunda Can you share the complete script you used for deploying the model? Thanks in Advance
I am getting the same CUDA error even when I am trying with 4 GPUs which is strange.
@PhaneendraGunda just for clarification, what is the "same CUDA error" here? Do you face the same OOM issue even when using 4 GPUs?
Hi,
I was able to run TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ model on 2 A10 gpus on AWS Sagemaker. I was using ml.g5.12xlarge instance type. Command to run the code
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --dtype float16 --max-model-len 1024 --gpu-memory-utilization 0.95 --tensor-parallel-size 2
But when I am trying to run the same model on AWS EKS k8's pod with 2 A10 GPUs, it's getting failed. It was throwing following error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 45.00 MiB is free. Process 40317 has 21.92 GiB memory in use. Of the allocated memory 20.02 GiB is allocated by PyTorch, and 260.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
After debugging, i realized that the model is running in the eager_mode. Here is the working command
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --dtype float16 --max-model-len 1024 --gpu-memory-utilization 0.95 --tensor-parallel-size 2 --enforce-eager
But I don't see any difference in the above two scenarios. In one scenario, model is running in cude graph mode and another scenario it is running only in eager_mode( which will reduce performance in production setup).
Am I missing anything in here ?