vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.24k stars 3.14k forks source link

[Gemma 2 27B]: Update docker hub image to support gemma-2-27B-it #6071

Open vipulgote1999 opened 2 days ago

vipulgote1999 commented 2 days ago

The model to consider.

I am trying to run docker image of vllm for gemma-2-27B-it, But facing architectures not recognized error.

error: ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Entire command with logs: docker run --runtime nvidia --gpus all -v ~/Vipul/nltk_data:/home/user/nltk_data --env "HUGGING_FACE_HUB_TOKEN=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" -p 8514:8514 --ipc=host --env "CUDA_VISIBLE_DEVICES=1" --entrypoint "python3" vllm/vllm-openai:latest -m vllm.entrypoints.openai.api_server --model "mlx-community/gemma-2-9b-it-8bit" --gpu-memory-utilization 0.96 --port 8514 --trust-remote-code --tensor-parallel-size 1 --use-v2-block-manager INFO 07-02 15:11:13 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 07-02 15:11:13 api_server.py:178] args: Namespace(host=None, port=8514, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mlx-community/gemma-2-9b-it-8bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.96, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True`. warnings.warn( Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained config_class = CONFIG_MAPPING[config_dict["model_type"]] File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 653, in getitem raise KeyError(key) KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args engine_config = engine_args.create_engine_config() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config model_config = ModelConfig( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 137, in init self.hf_config = get_config(self.model, trust_remote_code, revision, File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 48, in get_config raise e File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 33, in get_config config = AutoConfig.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained raise ValueError( ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. `

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

for gemma2 not recognized error. I think we need to rebuild vllm docker image with updated transformers package and push to docker hub. can you please do that. Anyways thanks for creating awesome framework.

mgoin commented 2 days ago

Hi @vipulgote1999 this will be resolved with the next release this week, where the docker image will be updated #5806

vipulgote1999 commented 1 day ago

Thanks @mgoin for query resolution.