[Gemma 2 27B]: Update docker hub image to support gemma-2-27B-it

The model to consider.

I am trying to run docker image of vllm for gemma-2-27B-it, But facing architectures not recognized error.

error: ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Entire command with logs: docker run --runtime nvidia --gpus all -v ~/Vipul/nltk_data:/home/user/nltk_data --env "HUGGING_FACE_HUB_TOKEN=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" -p 8514:8514 --ipc=host --env "CUDA_VISIBLE_DEVICES=1" --entrypoint "python3" vllm/vllm-openai:latest -m vllm.entrypoints.openai.api_server --model "mlx-community/gemma-2-9b-it-8bit" --gpu-memory-utilization 0.96 --port 8514 --trust-remote-code --tensor-parallel-size 1 --use-v2-block-manager INFO 07-02 15:11:13 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 07-02 15:11:13 api_server.py:178] args: Namespace(host=None, port=8514, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mlx-community/gemma-2-9b-it-8bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.96, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True`. warnings.warn( Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained config_class = CONFIG_MAPPING[config_dict["model_type"]] File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 653, in getitem raise KeyError(key) KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args engine_config = engine_args.create_engine_config() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config model_config = ModelConfig( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 137, in init self.hf_config = get_config(self.model, trust_remote_code, revision, File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 48, in get_config raise e File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 33, in get_config config = AutoConfig.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained raise ValueError( ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. `

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

for gemma2 not recognized error. I think we need to rebuild vllm docker image with updated transformers package and push to docker hub. can you please do that. Anyways thanks for creating awesome framework.

vllm-project / vllm