I am trying to run docker image of vllm for gemma-2-27B-it, But facing architectures not recognized error.
error:
ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
Entire command with logs:
docker run --runtime nvidia --gpus all -v ~/Vipul/nltk_data:/home/user/nltk_data --env "HUGGING_FACE_HUB_TOKEN=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" -p 8514:8514 --ipc=host --env "CUDA_VISIBLE_DEVICES=1" --entrypoint "python3" vllm/vllm-openai:latest -m vllm.entrypoints.openai.api_server --model "mlx-community/gemma-2-9b-it-8bit" --gpu-memory-utilization 0.96 --port 8514 --trust-remote-code --tensor-parallel-size 1 --use-v2-block-manager INFO 07-02 15:11:13 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 07-02 15:11:13 api_server.py:178] args: Namespace(host=None, port=8514, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mlx-community/gemma-2-9b-it-8bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.96, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 653, in getitem
raise KeyError(key)
KeyError: 'gemma2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in
engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config
model_config = ModelConfig(
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 137, in init
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 48, in get_config
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 33, in get_config
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type gemma2 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
`
The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
for gemma2 not recognized error. I think we need to rebuild vllm docker image with updated transformers package and push to docker hub.
can you please do that. Anyways thanks for creating awesome framework.
The model to consider.
I am trying to run docker image of vllm for gemma-2-27B-it, But facing architectures not recognized error.
error: ValueError: The checkpoint you are trying to load has model type
gemma2
but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.Entire command with logs:
docker run --runtime nvidia --gpus all -v ~/Vipul/nltk_data:/home/user/nltk_data --env "HUGGING_FACE_HUB_TOKEN=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" -p 8514:8514 --ipc=host --env "CUDA_VISIBLE_DEVICES=1" --entrypoint "python3" vllm/vllm-openai:latest -m vllm.entrypoints.openai.api_server --model "mlx-community/gemma-2-9b-it-8bit" --gpu-memory-utilization 0.96 --port 8514 --trust-remote-code --tensor-parallel-size 1 --use-v2-block-manager INFO 07-02 15:11:13 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 07-02 15:11:13 api_server.py:178] args: Namespace(host=None, port=8514, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mlx-community/gemma-2-9b-it-8bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.96, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use
force_download=True`. warnings.warn( Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained config_class = CONFIG_MAPPING[config_dict["model_type"]] File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 653, in getitem raise KeyError(key) KeyError: 'gemma2'During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in
engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config
model_config = ModelConfig(
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 137, in init
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 48, in get_config
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 33, in get_config
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type
gemma2
but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. `The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
for gemma2 not recognized error. I think we need to rebuild vllm docker image with updated transformers package and push to docker hub. can you please do that. Anyways thanks for creating awesome framework.