Error when initializing LLMEngine with multi lora using vllm==0.3.0 - RuntimeError: CUDA error: no kernel image is available for execution on the device

bjornjee commented 7 months ago

Got the error above while trying to run this code

from vllm import EngineArgs, LLMEngine

def initialize_engine() -> LLMEngine:
    """Initialize the LLMEngine."""
    # max_loras: controls the number of LoRAs that can be used in the same
    #   batch. Larger numbers will cause higher memory usage, as each LoRA
    #   slot requires its own preallocated tensor.
    # max_lora_rank: controls the maximum supported rank of all LoRAs. Larger
    #   numbers will cause higher memory usage. If you know that all LoRAs will
    #   use the same rank, it is recommended to set this as low as possible.
    # max_cpu_loras: controls the size of the CPU LoRA cache.
    engine_args = EngineArgs(model="meta-llama/Llama-2-7b-chat-hf",
                             revision="c1b0db933684edbfe29a06fa47eb19cc48025e93",
                             enable_lora=True,
                             max_loras=8,
                             max_lora_rank=8,
                             max_cpu_loras=8,
                             max_num_seqs=256)
    return LLMEngine.from_engine_args(engine_args)
if __name__ == '__main__':
  engine = initialize_engine()

Machine specification

uname -vmpo
#56~20.04.1-Ubuntu SMP Tue Nov 28 15:43:31 UTC 2023 x86_64 x86_64 GNU/Linux

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi
Thu Feb  8 02:14:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8              14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

pip --version
pip 21.2.4 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

python3 --version
Python 3.10.6

Steps to recreate

pip install vllm==0.3.0
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
pip install xformers==0.0.23.post1 -f https://download.pytorch.org/whl/cu121 --force-reinstall

Sanity checks

python3 -c "import torch; print(f'is_avail: {torch.cuda.is_available()}, version.cuda: {torch.version.cuda}');"
is_avail: True, version.cuda: 12.1

Expected behavior

To be able to use vllm.LLMEngine

bjornjee commented 7 months ago

Update: resolved issue above by install vllm from source for specific compute capability based on gpu arch. gpu arch - compute capability matrix can be found here https://en.wikipedia.org/wiki/CUDA#GPUs_supported

arr=( 7.5 )
TORCH_CUDA_ARCH_LIST=$( IFS=:; printf '%s' "${arr[*]}")
export TORCH_CUDA_ARCH_LIST
cd vllm
TORCH_CUDA_ARCH_LIST pip install .

However, multi lora support for vllm uses punica for addition of lora weight, which itself requires GPU to have a compute capability >= 8.0.

Maybe we should we support multi lora without punica and update the docs?

asifrkhan commented 6 months ago

@bjornjee were you able to get multi lora working on the T4? (your above comment makes it sound like the punica dependency doesn't work on the T4, but I may be misinterpreting)

bjornjee commented 5 months ago

I wasn't able to get it working on t4.

vllm-project / vllm