vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.62k stars 3.9k forks source link

[Bug]: vllm hangs after model download / load #7303

Open ArtificialEU opened 1 month ago

ArtificialEU commented 1 month ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

On the Tesla T4 the model "hangs" after loading the model (the vram usage spikes normally and stays constant) but nothings comes after

Setup

` containers:

INFO 08-08 14:40:35 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-08 14:40:35 api_server.py:220] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='microsoft/Phi-3-mini-4k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-08 14:40:35 config.py:1425] Casting torch.bfloat16 to torch.float16.
INFO 08-08 14:40:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-08 14:40:36 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-08 14:40:36 selector.py:54] Using XFormers backend.
INFO 08-08 14:40:37 model_runner.py:680] Starting to load model microsoft/Phi-3-mini-4k-instruct...
INFO 08-08 14:40:37 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-08 14:40:37 selector.py:54] Using XFormers backend.
INFO 08-08 14:40:37 weight_utils.py:223] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:19<00:19, 19.36s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:57<00:00, 30.26s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:57<00:00, 28.63s/it]

INFO 08-08 14:41:35 model_runner.py:692] Loading model weights took 7.1183 GB
INFO 08-08 14:41:37 gpu_executor.py:102] # GPU blocks: 953, # CPU blocks: 682

On my 4090 here is the expected trace

root@ai:/home/ai/ai/works/backend/vllm# docker logs -f 66c6c6a975b5
INFO 08-08 14:47:21 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-08 14:47:21 api_server.py:220] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='microsoft/Phi-3-mini-4k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16384, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
    run_server(args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
    model_config = ModelConfig(
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 167, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1532, in _get_and_verify_max_len
    raise ValueError(
ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=4096 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
root@ai:/home/ai/ai/works/backend/vllm# docker compose up -d
[+] Running 1/1
 ✔ Container vllm-vllm-1  Started                                                                                            8.8s 
root@ai:/home/ai/ai/works/backend/vllm# docker ps
CONTAINER ID   IMAGE                     COMMAND                  CREATED          STATUS                            PORTS                                       NAMES
5f4ef9932a3e   vllm/vllm-openai:latest   "python3 -m vllm.ent…"   10 seconds ago   Up 5 seconds (health: starting)   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp   vllm-vllm-1
root@ai:/home/ai/ai/works/backend/vllm# docker logs -f 5f4ef9932a3e
INFO 08-08 14:47:50 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-08 14:47:50 api_server.py:220] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='microsoft/Phi-3-mini-4k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-08 14:47:51 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-08 14:47:54 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-08 14:47:54 selector.py:54] Using XFormers backend.
INFO 08-08 14:47:57 model_runner.py:680] Starting to load model microsoft/Phi-3-mini-4k-instruct...
INFO 08-08 14:47:58 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-08 14:47:58 selector.py:54] Using XFormers backend.
INFO 08-08 14:47:58 weight_utils.py:223] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.96it/s]

INFO 08-08 14:49:10 model_runner.py:692] Loading model weights took 7.1183 GB
INFO 08-08 14:49:10 gpu_executor.py:102] # GPU blocks: 2323, # CPU blocks: 682
INFO 08-08 14:49:13 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-08 14:49:13 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-08 14:49:23 model_runner.py:1181] Graph capturing finished in 11 secs.
WARNING 08-08 14:49:23 serving_embedding.py:170] embedding_mode is False. Embedding API will not work.
INFO 08-08 14:49:23 api_server.py:292] Available routes are:
INFO 08-08 14:49:23 api_server.py:297] Route: /openapi.json, Methods: GET, HEAD
INFO 08-08 14:49:23 api_server.py:297] Route: /docs, Methods: GET, HEAD
INFO 08-08 14:49:23 api_server.py:297] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-08 14:49:23 api_server.py:297] Route: /redoc, Methods: GET, HEAD
INFO 08-08 14:49:23 api_server.py:297] Route: /health, Methods: GET
INFO 08-08 14:49:23 api_server.py:297] Route: /tokenize, Methods: POST
INFO 08-08 14:49:23 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-08 14:49:23 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-08 14:49:23 api_server.py:297] Route: /version, Methods: GET
INFO 08-08 14:49:23 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-08 14:49:23 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-08 14:49:23 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

I don't understand why the vram usage is nowhere close it's a simple model, any help is appreciated

youkaichao commented 1 month ago

you can follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to get more information on what's going on.

wlwqq commented 1 month ago

@youkaichao

i also get this error on Tesla T4 run model gemma-2-2b-it

INFO 08-12 14:54:48 selector.py:79] Using Flashinfer backend.
WARNING 08-12 14:54:48 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 08-12 14:54:49 selector.py:79] Using Flashinfer backend.
WARNING 08-12 14:54:49 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.

hangs after above logs

burakaytan commented 1 month ago

Did you solve the problem? I'm experiencing the same problem

ccs96307 commented 4 days ago

I encountered the same problem when I loading gemma-2, my GPU is V100

bodasadallah commented 1 day ago

I had the same issue, and for me the problem was in the memory. If you are using a compute cluster, make sure you are also allocating enough physical memory, not only GPU-memory.

ccs96307 commented 14 hours ago

Thanks, you are correct. I reviewed my deployment environment, I found I allocate not enough memory.