Open eldarkurtic opened 4 months ago
Most likely QLoRA is supported, whereas standard bnb quantization is not?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
i have the same problem here. cannot load hf model quantized with bnb 4bit
INFO 11-09 07:13:23 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 11-09 07:13:23 api_server.py:529] args: Namespace(subparser='serve', model_tag='unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7274a195fa30>)
INFO 11-09 07:13:23 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/b4840bb4-04fa-4c2d-8049-69ceb268fc37 for IPC Path.
INFO 11-09 07:13:23 api_server.py:179] Started engine process with PID 36969
WARNING 11-09 07:13:26 config.py:321] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 11-09 07:13:26 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-09 07:13:28 config.py:321] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 11-09 07:13:28 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-09 07:13:28 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit', speculative_config=None, tokenizer='unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 11-09 07:13:29 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 11-09 07:13:29 selector.py:115] Using XFormers backend.
/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-09 07:13:29 model_runner.py:1056] Starting to load model unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit...
INFO 11-09 07:13:30 selector.py:115] Using XFormers backend.
INFO 11-09 07:13:30 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/ezel/miniconda3/envs/310/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/ezel/miniconda3/envs/310/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 334, in __init__
self.model_executor = executor_class(
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/model_executor/models/mllama.py", line 1306, in load_weights
param = params_dict.pop(name)
KeyError: 'language_model.model.layers.0.mlp.down_proj.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/ezel/miniconda3/envs/310/bin/vllm", line 8, in <module>
sys.exit(main())
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/ezel/miniconda3/envs/310/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/ezel/miniconda3/envs/310/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/ezel/miniconda3/envs/310/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
(310) ➜ ~
Your current environment
🐛 Describe the bug
I am trying to evaluate a BNB model (https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4) through
lm-evaluation-harness
withvllm
. This is the command I am running:and I am seeing the following error (which I think is related to vllm):
I am not sure why vllm looks for
adapter_name_or_path
when the model is just a BNB-quantized to NF4.