Closed XCYXHL closed 1 month ago
If I remember correctly, BNB only supports enforce_eager = True
I am getting the same error after pip isntalling the latest vllm 0.5.4 in order to solve 5753, openai.InternalServerError: modal-http: internal server error: status Failure: ValueError('BitAndBytes with enforce_eager = False is not supported yet.')
Runner failed with exception: ValueError('BitAndBytes with enforce_eager = False is not supported yet.')
WARNING 08-08 08:27:54 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Traceback (most recent call last):
File "/pkg/modal/_container_io_manager.py", line 462, in handle_user_exception
yield
File "/pkg/modal/_container_entrypoint.py", line 786, in main
finalized_functions = service.get_finalized_functions(container_args.function_def, container_io_manager)
File "/pkg/modal/_container_entrypoint.py", line 152, in get_finalized_functions
web_callable = construct_webhook_callable(
File "/pkg/modal/_container_entrypoint.py", line 68, in construct_webhook_callable
return asgi_app_wrapper(user_defined_callable(), container_io_manager)
File "/root/vllm_inference.py", line 149, in serve
engine = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/usr/local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 862, in create_engine_config
return EngineConfig(
File "<string>", line 15, in __init__
File "/usr/local/lib/python3.10/site-packages/vllm/config.py", line 1641, in __post_init__
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/usr/local/lib/python3.10/site-packages/vllm/config.py", line 293, in verify_with_parallel_config
raise ValueError(
ValueError: BitAndBytes with enforce_eager = False is not supported yet.
currently set enforce_eager=True
as a workaround.
@jeejeelee maybe we should just warn if it is False and always set enforce_eager=True
@jeejeelee maybe we should just warn if it is False and always set enforce_eager=True
Ok, I'm handling it now.
AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet.
AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet.
Have you solved this problem?@lonngxiang
Your current environment
vLLM API server version 0.5.4 python 3.11.5 when i use vllm to doploy the model Mistral-Large-Instruct-2407-bnb-4bit,there is something wrong
the model is from https://www.modelscope.cn/models/LLM-Research/Mistral-Large-Instruct-2407-bnb-4bit/files and files like tokenizer are from https://huggingface.co/mistralai/Mistral-Large-Instruct-2407/blob/main/config.json
🐛 Describe the bug
/home/workspace/Mistral-Large-Instruct-2407-bnb-4bit$ CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model /home/workspace/Mistral-Large-Instruct-2407-bnb-4bit --served-model-name Mistral-Large-Instruct-2407-bnb-4bit --gpu-memory-utilization .6 --host 192.168.0.109 --port 8006 INFO 08-08 14:21:38 api_server.py:339] vLLM API server version 0.5.4 INFO 08-08 14:21:38 api_server.py:340] args: Namespace(host='192.168.0.109', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/bao/workspace/Mistral-Large-Instruct-2407-bnb-4bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.6, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Mistral-Large-Instruct-2407-bnb-4bit'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 08-08 14:21:38 config.py:1454] Casting torch.bfloat16 to torch.float16. WARNING 08-08 14:21:38 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 08-08 14:21:38 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. Process Process-1: Traceback (most recent call last): File "/raid/anaconda3/envs/vllm-041/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/raid/anaconda3/envs/vllm-041/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, **self._kwargs) File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 462, in from_engine_args engine_config = engine_args.create_engine_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 862, in create_engine_config return EngineConfig( ^^^^^^^^^^^^^ File "", line 15, in init
File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/config.py", line 1641, in __post_init__
self.model_config.verify_with_parallel_config(self.parallel_config)
File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/config.py", line 293, in verify_with_parallel_config
raise ValueError(
ValueError: BitAndBytes with enforce_eager = False is not supported yet.