Closed sudarshan-kamath closed 2 months ago
The error means that you don't have FlashInfer installed. Please follow the steps shared here.
Thanks, there was an error with the flashinfer installed, so I tried the flashinfer installation using this
pip install flashinfer==0.1.3 -i https://flashinfer.ai/whl/cu121/torch2.3/
Now, I have a different error:
WARNING 08-06 12:10:50 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-06 12:10:50 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=True, enable_prefix_caching=False)
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
INFO 08-06 12:10:51 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
2024-08-06 12:10:51 | ERROR | stderr | Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-06 12:10:51 | ERROR | stderr |
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.43s/it]
2024-08-06 12:10:53 | ERROR | stderr |
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.32it/s]
2024-08-06 12:10:53 | ERROR | stderr |
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.17it/s]
2024-08-06 12:10:53 | ERROR | stderr |
2024-08-06 12:10:53 | ERROR | stderr |
INFO 08-06 12:10:53 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/project/vllm_worker.py", line 236, in <module>
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: engine = cls(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self.engine = self._init_engine(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: return engine_class(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self._initialize_kv_caches()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self.model_executor.determine_num_available_blocks())
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: return self.driver_worker.determine_num_available_blocks()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self.model_runner.profile_run()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1292, in execute_model
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: model_input.attn_metadata.begin_forward()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 146, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self.prefill_wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: self._wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257
UPDATE: Looks similar to this error https://github.com/vllm-project/vllm/issues/7070
This works. Please use pip install flashinfer==0.1.2 -i https://flashinfer.ai/whl/cu121/torch2.3
Thanks @DarkLight1337
Thanks @DarkLight1337
Is there more context on the change the OP made in regards to the "hidden_act" versus "hidden_activation" reference? I am seeing the following error as well:
AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'
@JerryGamble1 When the weights are downloaded, please change "hidden_activation" to "hidden_act" in the file "config.json". Usually the weights are present in the huggingface cache directory.
https://huggingface.co/google/shieldgemma-2b/blob/main/config.json
If you use the command huggingface-cli download model_name
, it should download the model and then output the location where the weights are stored.
We've moved on from trying to get this to work on VLLM for now so no need to respond, but just FYI...
Modifying the config file I was able to load the model into VLLM, but every requests generates a bad request error with this log message...
INFO: 172.17.0.2:57780 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request ERROR 08-14 11:02:07 serving_chat.py:112] Error in applying chat template from request: 'guideline' is undefined
Trying to run the Shieldgemma model.
The architecture is Gemma2ForCausalLM which should be already supported. The config file specifies the transformers version to be 4.42.4.
I have the following installed:
I have also the Transformers 4.43.3.
After checking the config file, it appears that the config specifies
hidden_activation
instead ofhidden_act
. After changing it manually in the config.json file, I get an error which specifies that I should use flashinfer backend.VLLM_ATTENTION_BACKEND=FLASHINFER
After which, the following error is occurring: