vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.98k stars 3.81k forks source link

[Bug]: Unable to run phi-3-small in latest release #6334

Closed ssmi153 closed 1 month ago

ssmi153 commented 1 month ago

Your current environment

Running vllm openai docker container on a single A5000 GPU on Runpod.

Initialisation settings: --host 0.0.0.0 --model microsoft/Phi-3-small-8k-instruct --tensor-parallel-size 1 --max-model-len 8192 --trust-remote-code

šŸ› Describe the bug

Error on launch when running release 0.5.1 and trying to run Phi-3-small-8k-instruct. This is a new error in this release and wasn't an issue in v0.5.0.post1. Other models seem to work fine (tested on Mistral and llama).

2024-07-11T12:28:23.200321639Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3_small.py", line 361, in __init__
2024-07-11T12:28:23.200325972Z [rank0]:     self.model = Phi3SmallModel(config, cache_config, quant_config)
2024-07-11T12:28:23.200340022Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3_small.py", line 310, in __init__
2024-07-11T12:28:23.200344449Z [rank0]:     self.layers = nn.ModuleList([
2024-07-11T12:28:23.200348779Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3_small.py", line 311, in <listcomp>
2024-07-11T12:28:23.200353076Z [rank0]:     Phi3SmallDecoderLayer(config, layer_idx, cache_config,
2024-07-11T12:28:23.200357223Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3_small.py", line 261, in __init__
2024-07-11T12:28:23.200361509Z [rank0]:     self.self_attn = Phi3SmallSelfAttention(config,
2024-07-11T12:28:23.200366651Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3_small.py", line 213, in __init__
2024-07-11T12:28:23.200370883Z [rank0]:     self.attn = Attention(
2024-07-11T12:28:23.200375261Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 77, in __init__
2024-07-11T12:28:23.200379563Z [rank0]:     attn_backend = get_attn_backend(num_heads, head_size, num_kv_heads,
2024-07-11T12:28:23.200383728Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 41, in get_attn_backend
2024-07-11T12:28:23.200388053Z [rank0]:     from vllm.attention.backends.blocksparse_attn import (
2024-07-11T12:28:23.200392951Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/blocksparse_attn.py", line 8, in <module>
2024-07-11T12:28:23.200397378Z [rank0]:     from vllm.attention.ops.blocksparse_attention.interface import (
2024-07-11T12:28:23.200401813Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/blocksparse_attention/interface.py", line 8, in <module>
2024-07-11T12:28:23.200409568Z [rank0]:     from .utils import (dense_to_crow_col, get_head_sliding_step,
2024-07-11T12:28:23.200413858Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/blocksparse_attention/utils.py", line 13, in <module>
2024-07-11T12:28:23.200418272Z [rank0]:     raise ImportError("Please install scipy via "
2024-07-11T12:28:23.200422491Z [rank0]: ImportError: Please install scipy via `pip install scipy` to use BlockSparseAttention in models such as Phi-3.
2024-07-11T12:28:39.896866889Z INFO 07-11 12:28:39 api_server.py:206] vLLM API server version 0.5.1
2024-07-11T12:28:39.896918569Z INFO 07-11 12:28:39 api_server.py:207] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='microsoft/Phi-3-small-8k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-07-11T12:28:40.554598447Z INFO 07-11 12:28:40 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='microsoft/Phi-3-small-8k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-small-8k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-small-8k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
2024-07-11T12:28:41.503188888Z WARNING 07-11 12:28:41 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
2024-07-11T12:28:41.986576214Z INFO 07-11 12:28:41 selector.py:40] Using BlocksparseFlashAttention backend.
2024-07-11T12:28:42.102180877Z [rank0]: Traceback (most recent call last):
2024-07-11T12:28:42.102227516Z [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/ops/blocksparse_attention/utils.py", line 11, in <module>
2024-07-11T12:28:42.102233654Z [rank0]:     from scipy import sparse
2024-07-11T12:28:42.102238066Z [rank0]: ModuleNotFoundError: No module named 'scipy'
mgoin commented 1 month ago

@ssmi153 Phi3-small requires an optional dependency due to the blocksparse attention. You need to add scipy to your image for it to work. This is intended behavior.

ssmi153 commented 1 month ago

Thanks @mgoin . Iā€™m using the official vllm OpenAI docker container for this. The other option rather than removing the scipy dependency is just to add scipy to this docker container.

atineoSE commented 1 month ago

@ssmi153 here is what I've done to install the scipy dependency in the Docker image:

git clone git@github.com:vllm-project/vllm.git
cd vllm
patch Dockerfile << EOF
202c202
<     pip install accelerate hf_transfer 'modelscope!=1.15.0'
---
>     pip install accelerate hf_transfer 'modelscope!=1.15.0' scipy
EOF
sudo docker build -t vllm_scipy .

This applies a patch to the Dockerfile, by adding the scipy dependency in the last build stage. It then builds a new image with the dependency (it takes a while).

You can then run the new docker image vllm_scipy and you'll be able to load the model successfully.

ssmi153 commented 1 month ago

Thanks for the workaround @atineoSE, and thanks to @mgoin for implementing a fix.