vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.15k stars 4.35k forks source link

[Misc]: Support for Shieldgemma model #7084

Closed sudarshan-kamath closed 2 months ago

sudarshan-kamath commented 2 months ago

Trying to run the Shieldgemma model.

The architecture is Gemma2ForCausalLM which should be already supported. The config file specifies the transformers version to be 4.42.4.

I have the following installed:

pip list | grep "vllm\|flash"
flash-attn                        2.0.4
flashinfer                        0.1.3+cu124torch2.4
vllm                              0.5.3.post1
vllm-flash-attn                   2.5.9.post1

I have also the Transformers 4.43.3.

After checking the config file, it appears that the config specifies hidden_activation instead of hidden_act. After changing it manually in the config.json file, I get an error which specifies that I should use flashinfer backend.

VLLM_ATTENTION_BACKEND=FLASHINFER After which, the following error is occurring:

INFO 08-02 17:46:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
INFO 08-02 17:46:36 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
2024-08-02 17:46:37 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-02 17:46:37 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.41s/it]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.31it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | 
INFO 08-02 17:46:38 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1272, in execute_model
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     BatchDecodeWithPagedKVCacheWrapper(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: TypeError: 'NoneType' object is not callable
DarkLight1337 commented 2 months ago

The error means that you don't have FlashInfer installed. Please follow the steps shared here.

sudarshan-kamath commented 2 months ago

Thanks, there was an error with the flashinfer installed, so I tried the flashinfer installation using this

pip install flashinfer==0.1.3 -i https://flashinfer.ai/whl/cu121/torch2.3/

Now, I have a different error:

WARNING 08-06 12:10:50 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-06 12:10:50 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=True, enable_prefix_caching=False)
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
INFO 08-06 12:10:51 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
2024-08-06 12:10:51 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-06 12:10:51 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.43s/it]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.32it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | 
INFO 08-06 12:10:53 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1292, in execute_model
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     model_input.attn_metadata.begin_forward()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 146, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.prefill_wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

UPDATE: Looks similar to this error https://github.com/vllm-project/vllm/issues/7070

This works. Please use pip install flashinfer==0.1.2 -i https://flashinfer.ai/whl/cu121/torch2.3

Thanks @DarkLight1337

sudarshan-kamath commented 2 months ago

Thanks @DarkLight1337

JerryGamble1 commented 2 months ago

Is there more context on the change the OP made in regards to the "hidden_act" versus "hidden_activation" reference? I am seeing the following error as well:

AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'

sudarshan-kamath commented 2 months ago

@JerryGamble1 When the weights are downloaded, please change "hidden_activation" to "hidden_act" in the file "config.json". Usually the weights are present in the huggingface cache directory.

https://huggingface.co/google/shieldgemma-2b/blob/main/config.json

If you use the command huggingface-cli download model_name, it should download the model and then output the location where the weights are stored.

JerryGamble1 commented 2 months ago

We've moved on from trying to get this to work on VLLM for now so no need to respond, but just FYI...

Modifying the config file I was able to load the model into VLLM, but every requests generates a bad request error with this log message...

INFO: 172.17.0.2:57780 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request ERROR 08-14 11:02:07 serving_chat.py:112] Error in applying chat template from request: 'guideline' is undefined