vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.78k stars 4.49k forks source link

[Usage]: Unable to run `gemma-2-2b-it` #7060

Open Maximgitman opened 3 months ago

Maximgitman commented 3 months ago

The model to consider.

https://huggingface.co/google/gemma-2-2b-it

The closest model vllm already supports.

https://huggingface.co/google/gemma-2-9b-it

What's your difficulty of supporting the model you want?

Faster inference

DarkLight1337 commented 3 months ago

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Maximgitman commented 3 months ago

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error

import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

vLLM version 0.5.3

arunpatala commented 3 months ago

same error for me using latest docker image

DarkLight1337 commented 3 months ago

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error

import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

vLLM version 0.5.3

Can you show the full log?

Zbaoli commented 3 months ago

i try to use export VLLM_ATTENTION_BACKEND=FLASHINFER to solve this error, but another error occured:

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable
DarkLight1337 commented 3 months ago
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

arunpatala commented 3 months ago

thanks. I was able to run with setting environment variable in docker command as follows:

volume_hf=$HF_HOME docker run --runtime nvidia --gpus all \ -v $volume_hf:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model "google/gemma-2-2b-it" \ --max-model-len 4096 \ --max-num-seqs 8 \ --served-model-name model

Zbaoli commented 3 months ago
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

yes, maybe it's the reason, but i encountered this error, i install flashinfer pacakge by using pip install flashinfer command.

>>> pip show flashinfer
Name: flashinfer
Version: 1.0.0
Summary: White Hat Researcher
Home-page: UNKNOWN
Author: Pastaga
Author-email: UNKNOWN
License: MIT
Location: /root/miniconda3/envs/vllm/lib/python3.11/site-packages
Requires:
Required-by:
>>> python -c "import flashinfer"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer'
DarkLight1337 commented 3 months ago

Make sure you're installing FlashInfer for the correct version of PyTorch/CUDA: https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation

DarkLight1337 commented 3 months ago

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

hornikmatej commented 3 months ago

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

Yes, you are right, works with v0.0.8 but not with v0.1.3, throws following error:

[rank0]:   File "/home/ubuntu/clone-brain/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 791, in begin_forward
[rank0]:     self._wrapper.begin_forward(
[rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

Using vLLM 0.5.3.post1

Maximgitman commented 3 months ago

Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM: Maybe will help someone

  1. Install and Verify vLLM Make sure the vLLM version is 0.5.3.

    !pip install vllm==0.5.3
    import vllm
    print(vllm.__version__)
    0.5.3
  2. Install Flashinfer Follow the instructions here to install Flashinfer. Check your torch version and CUDA compatibility:

    import torch
    print(torch.__version__)  # Should print: 2.3.1+cu121
    print(torch.version.cuda) # Should print: 12.1

Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation). !pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

  1. Update VLLM Backend Variable in Environment

    import os
    os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
  2. Test vLLM

    
    from vllm import LLM, SamplingParams

llm = LLM(model=, trust_remote_code=True)

sampling_params = SamplingParams( temperature=0.8, max_tokens=512, top_p=0.95, top_k=1, )

prompts = [ test_data[random.randint(0, test_data.shape[0])]["text"], ]

outputs = llm.generate( prompts, sampling_params )

Expected output:

Processed prompts: 100%|██████████| 1/1 [00:01<00:00, 1.24s/it, est. speed input: 991.44 toks/s, output: 87.79 toks/s]



Hardware
NVIDIA A100-SXM4-40GB

Should it be faster?
github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!