Closed kbfifi closed 3 weeks ago
3080 and 3090 have different VRAM, so using tensor parallel will cause OOM on 3080 due to unbalanced VRAM. You can try using pipeline parallel with --pipeline-parallel-size=2
.
@Isotr0py thanks for pointing out unbalance! I searched and saw indeed that llama.cpp can apparently VLLM can't deal with this. As my goal was to maximize inferenceperformance with this HW setup I think VLLM is not an improvement over llama.cpp with this HW.
Your current environment
How would you like to use vllm
I want this script to work so that I can use it for local inference. The llama31 model works for the 2 GPUs (3080, 3090) using llama.cpp so I expect it to fit in the total available VRAM (34GB) in my system. Now I get OOM errors. I tried to minimize the usage but no success.
I managed to get VLLM working using: model=bigcode/starcoder2-7b template=./templates/starcode.jinja (selfmade) (btw: where can I find a proper template for starcode?)
I hope someone can help me out to get de llama31 model running
!/bin/bash
source activate vllm export CUDA_DEVICE_ORDER=PCI_BUS_ID export NVIDIA_VISIBLE_DEVICES=0,1 # Make both GPUs visible
max_tokens=1024 template=./templates/llama31.jinja model=/mnt/extra_ext4/models/Llama-3.1-Nemotron-70B-Instruct-HF-IQ3_M.gguf
vllm serve $model \ --enable_chunked_prefill True \ --chat-template "$template" \ --tensor-parallel-size 2 \ --max-num-batched-tokens $max_tokens \ --gpu-memory-utilization 0.7 \ --max-num-seqs=1 \ --block-size 16 source deactivate
This reults in the following log:
Ends with:
Before submitting a new issue...