vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.26k stars 3.14k forks source link

[Bug]: ModuleNotFoundError: No module named 'bitsandbytes' #5503

Open emillykkejensen opened 3 weeks ago

emillykkejensen commented 3 weeks ago

Your current environment

Using Docker!

🐛 Describe the bug

Running v0.5.0 docker image with bitsandbytes quantization gives me the follwoing error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/bitsandbytes.py", line 83, in __init__
[rank0]:     import bitsandbytes
[rank0]: ModuleNotFoundError: No module named 'bitsandbytes'

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 395, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 147, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 775, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 97, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 340, in __init__
[rank0]:     self.model = LlamaModel(config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 262, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 263, in <listcomp>
[rank0]:     LlamaDecoderLayer(config=config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 188, in __init__
[rank0]:     self.self_attn = LlamaAttention(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 122, in __init__
[rank0]:     self.qkv_proj = QKVParallelLinear(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 540, in __init__
[rank0]:     super().__init__(input_size=input_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 233, in __init__
[rank0]:     super().__init__(input_size, output_size, skip_bias_add, params_dtype,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 147, in __init__
[rank0]:     self.quant_method = quant_config.get_quant_method(self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/bitsandbytes.py", line 67, in get_quant_method
[rank0]:     return BitsAndBytesLinearMethod(self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/bitsandbytes.py", line 88, in __init__
[rank0]:     raise ImportError("Please install bitsandbytes>=0.42.0 via "
[rank0]: ImportError: Please install bitsandbytes>=0.42.0 via `pip install bitsandbytes>=0.42.0` to use bitsandbytes quantizer.
jeejeelee commented 3 weeks ago

This is a feature, if you want to use bitsandbytes in VLLM, you must install bitsandbytes yourself firstly

emillykkejensen commented 3 weeks ago

Okay, thanks for the clarification. Any preferred way of adding feature dependencies to the vLLM image during run?

jeejeelee commented 3 weeks ago

If i undestand correctly, perhaps you can try:

docker exec -ti container_id /bin/bash 

After entering the container, run:

pip install bitsandbytes>=0.42.0

BTY, If you build image using dockerfile ,you can add this dependency in dockerfile

simon-mo commented 3 weeks ago

Please send a PR to include it in the Dockerfile so it can work out of the box.