vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.84k stars 4.11k forks source link

[Bug]: ValueError: BitAndBytes with enforce_eager = False is not supported yet. #7294

Closed XCYXHL closed 1 month ago

XCYXHL commented 1 month ago

Your current environment

vLLM API server version 0.5.4 python 3.11.5 when i use vllm to doploy the model Mistral-Large-Instruct-2407-bnb-4bit,there is something wrong

the model is from https://www.modelscope.cn/models/LLM-Research/Mistral-Large-Instruct-2407-bnb-4bit/files and files like tokenizer are from https://huggingface.co/mistralai/Mistral-Large-Instruct-2407/blob/main/config.json

🐛 Describe the bug

/home/workspace/Mistral-Large-Instruct-2407-bnb-4bit$ CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model /home/workspace/Mistral-Large-Instruct-2407-bnb-4bit --served-model-name Mistral-Large-Instruct-2407-bnb-4bit --gpu-memory-utilization .6 --host 192.168.0.109 --port 8006 INFO 08-08 14:21:38 api_server.py:339] vLLM API server version 0.5.4 INFO 08-08 14:21:38 api_server.py:340] args: Namespace(host='192.168.0.109', port=8006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/bao/workspace/Mistral-Large-Instruct-2407-bnb-4bit', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.6, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Mistral-Large-Instruct-2407-bnb-4bit'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 08-08 14:21:38 config.py:1454] Casting torch.bfloat16 to torch.float16. WARNING 08-08 14:21:38 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 08-08 14:21:38 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. Process Process-1: Traceback (most recent call last): File "/raid/anaconda3/envs/vllm-041/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/raid/anaconda3/envs/vllm-041/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, **self._kwargs) File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 462, in from_engine_args engine_config = engine_args.create_engine_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 862, in create_engine_config return EngineConfig( ^^^^^^^^^^^^^ File "", line 15, in init File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/config.py", line 1641, in __post_init__ self.model_config.verify_with_parallel_config(self.parallel_config) File "/home/dt_miaorh/.local/lib/python3.11/site-packages/vllm/config.py", line 293, in verify_with_parallel_config raise ValueError( ValueError: BitAndBytes with enforce_eager = False is not supported yet.

jeejeelee commented 1 month ago

If I remember correctly, BNB only supports enforce_eager = True

c123ian commented 1 month ago

I am getting the same error after pip isntalling the latest vllm 0.5.4 in order to solve 5753, openai.InternalServerError: modal-http: internal server error: status Failure: ValueError('BitAndBytes with enforce_eager = False is not supported yet.')

Runner failed with exception: ValueError('BitAndBytes with enforce_eager = False is not supported yet.')
WARNING 08-08 08:27:54 config.py:254] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 462, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 786, in main
    finalized_functions = service.get_finalized_functions(container_args.function_def, container_io_manager)
  File "/pkg/modal/_container_entrypoint.py", line 152, in get_finalized_functions
    web_callable = construct_webhook_callable(
  File "/pkg/modal/_container_entrypoint.py", line 68, in construct_webhook_callable
    return asgi_app_wrapper(user_defined_callable(), container_io_manager)
  File "/root/vllm_inference.py", line 149, in serve
    engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/usr/local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 862, in create_engine_config
    return EngineConfig(
  File "<string>", line 15, in __init__
  File "/usr/local/lib/python3.10/site-packages/vllm/config.py", line 1641, in __post_init__
    self.model_config.verify_with_parallel_config(self.parallel_config)
  File "/usr/local/lib/python3.10/site-packages/vllm/config.py", line 293, in verify_with_parallel_config
    raise ValueError(
ValueError: BitAndBytes with enforce_eager = False is not supported yet.
c123ian commented 1 month ago

currently set enforce_eager=Trueas a workaround.

mgoin commented 1 month ago

@jeejeelee maybe we should just warn if it is False and always set enforce_eager=True

jeejeelee commented 1 month ago

@jeejeelee maybe we should just warn if it is False and always set enforce_eager=True

Ok, I'm handling it now.

lonngxiang commented 1 month ago

AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet.

SmartFive commented 3 weeks ago

AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet.

Have you solved this problem?@lonngxiang