Open Maximgitman opened 3 months ago
Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.
Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.
Thank you for the response, I tried today and got the following error
import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'
llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.
vLLM version 0.5.3
same error for me using latest docker image
Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.
Thank you for the response, I tried today and got the following error
import os from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest # Setting the environment variable as suggested os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.
vLLM version 0.5.3
Can you show the full log?
i try to use export VLLM_ATTENTION_BACKEND=FLASHINFER
to solve this error,
but another error occured:
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]: BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model [rank0]: BatchDecodeWithPagedKVCacheWrapper( [rank0]: TypeError: 'NoneType' object is not callable
This means you don't have FlashInfer installed.
thanks. I was able to run with setting environment variable in docker command as follows:
volume_hf=$HF_HOME docker run --runtime nvidia --gpus all \ -v $volume_hf:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model "google/gemma-2-2b-it" \ --max-model-len 4096 \ --max-num-seqs 8 \ --served-model-name model
[rank0]: File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model [rank0]: BatchDecodeWithPagedKVCacheWrapper( [rank0]: TypeError: 'NoneType' object is not callable
This means you don't have FlashInfer installed.
yes, maybe it's the reason, but i encountered this error, i install flashinfer pacakge by using pip install flashinfer
command.
>>> pip show flashinfer
Name: flashinfer
Version: 1.0.0
Summary: White Hat Researcher
Home-page: UNKNOWN
Author: Pastaga
Author-email: UNKNOWN
License: MIT
Location: /root/miniconda3/envs/vllm/lib/python3.11/site-packages
Requires:
Required-by:
>>> python -c "import flashinfer"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer'
Make sure you're installing FlashInfer for the correct version of PyTorch/CUDA: https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation
Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)
Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)
Yes, you are right, works with v0.0.8 but not with v0.1.3, throws following error:
[rank0]: File "/home/ubuntu/clone-brain/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 791, in begin_forward
[rank0]: self._wrapper.begin_forward(
[rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257
Using vLLM 0.5.3.post1
Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM: Maybe will help someone
Install and Verify vLLM Make sure the vLLM version is 0.5.3.
!pip install vllm==0.5.3
import vllm
print(vllm.__version__)
0.5.3
Install Flashinfer Follow the instructions here to install Flashinfer. Check your torch version and CUDA compatibility:
import torch
print(torch.__version__) # Should print: 2.3.1+cu121
print(torch.version.cuda) # Should print: 12.1
Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation).
!pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/
Update VLLM Backend Variable in Environment
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
Test vLLM
from vllm import LLM, SamplingParams
llm = LLM(model=
sampling_params = SamplingParams( temperature=0.8, max_tokens=512, top_p=0.95, top_k=1, )
prompts = [ test_data[random.randint(0, test_data.shape[0])]["text"], ]
outputs = llm.generate( prompts, sampling_params )
Hardware
NVIDIA A100-SXM4-40GB
Should it be faster?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
The model to consider.
https://huggingface.co/google/gemma-2-2b-it
The closest model vllm already supports.
https://huggingface.co/google/gemma-2-9b-it
What's your difficulty of supporting the model you want?
Faster inference