Open bjornjee opened 7 months ago
Update: resolved issue above by install vllm from source for specific compute capability based on gpu arch. gpu arch - compute capability matrix can be found here https://en.wikipedia.org/wiki/CUDA#GPUs_supported
arr=( 7.5 )
TORCH_CUDA_ARCH_LIST=$( IFS=:; printf '%s' "${arr[*]}")
export TORCH_CUDA_ARCH_LIST
cd vllm
TORCH_CUDA_ARCH_LIST pip install .
However, multi lora support for vllm uses punica for addition of lora weight, which itself requires GPU to have a compute capability >= 8.0.
Maybe we should we support multi lora without punica and update the docs?
@bjornjee were you able to get multi lora working on the T4? (your above comment makes it sound like the punica dependency doesn't work on the T4, but I may be misinterpreting)
I wasn't able to get it working on t4.
Got the error above while trying to run this code
Machine specification
Steps to recreate
pip install vllm==0.3.0
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
pip install xformers==0.0.23.post1 -f https://download.pytorch.org/whl/cu121 --force-reinstall
Sanity checks
Expected behavior
To be able to use vllm.LLMEngine