vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.36k stars 3.67k forks source link

[Usage]: vllm openai api server never ends in most cases #6228

Open hassanzadeh opened 1 month ago

hassanzadeh commented 1 month ago

Your current environment

Hey Guys, I tried the open ai api server, to load a 70B Llama-3 checkpoint. I think out of the 3-4 efforts I did, only one time the model successfully loaded after about 1 our, for the other two times, nothing happened, even after 3 hours of wait time. I'm loading the model on 8xA100/80G azure nodes. Am I following the right practice? For the failed cases, cuda memory usage wont exceed 18G (it should be around 70-80G otherwise)

How would you like to use vllm

(nlp) azureuser@dev-8xa100-b:/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-8xa100-b/code/Users/hamid.hassanzadeh/cava-data/llama_expr$ NAME='Llama-3-70B-Instruct' && python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --served-model-name $NAME --model /home/azureuser/cloudfiles/code/Users/user/llama_setup/model_catalog_llama/Meta-Llama-3-70B-Instruct/mlflow_model_folder/data/model/ --tensor-parallel-size 8 --trust-remote-code
INFO 07-08 20:49:31 api_server.py:206] vLLM API server version 0.5.1
INFO 07-08 20:49:31 api_server.py:207] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, device='auto', disable_custom_all_reduce=False, disable_log_requests=False, disable_log_stats=False, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_chunked_prefill=False, enable_lora=False, enable_prefix_caching=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host='0.0.0.0', kv_cache_dtype='auto', load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_parallel_loading_workers=None, max_seq_len_to_capture=8192, middleware=[], model='/home/azureuser/cloudfiles/code/Users/userllama_setup/model_catalog_llama/Meta-Llama-3-70B-Instruct/mlflow_model_folder/data/model/', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_speculative_tokens=None, otlp_traces_endpoint=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=['Llama-3-70B-Instruct'], skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=8, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 07-08 20:49:31 config.py:698] Defaulting to use mp for distributed inference
INFO 07-08 20:49:31 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/home/azureuser/cloudfiles/code/Users/user/llama_setup/model_catalog_llama/Meta-Llama-3-70B-Instruct/mlflow_model_folder/data/model/', speculative_config=None, tokenizer='/home/azureuser/cloudfiles/code/Users/user/llama_setup/model_catalog_llama/Meta-Llama-3-70B-Instruct/mlflow_model_folder/data/model/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=Llama-3-70B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=8810) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8813) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8808) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8809) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8811) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8812) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8807) INFO 07-08 20:49:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=8808) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8808) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8810) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8810) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8813) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8813) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8807) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8807) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8809) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8809) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8811) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8812) INFO 07-08 20:49:36 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=8811) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8812) INFO 07-08 20:49:36 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=8812) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8810) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8811) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8813) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8807) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8809) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=8808) INFO 07-08 20:49:40 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/azureuser/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
youkaichao commented 1 month ago

how much cpu memory do you have? what is your disk read speed? if you press ctrl + C, where does it stop?

I suspect it is stuck in weight loading.

hassanzadeh commented 1 month ago

Thanks CPU men: 1.8TB I tried to stopped I think it did not stop for at least a minute so killed the compute instance by force.

youkaichao commented 1 month ago

https://docs.vllm.ai/en/latest/getting_started/debugging.html might help to debug hang.

hassanzadeh commented 1 month ago

Hi @youkaichao, I tried to shard the model using the script in the vllm repo, unfortunately it gets stuck too, what do you think?

youkaichao commented 1 month ago

before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.

hassanzadeh commented 1 month ago

before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.

Thanks, I'm going to do that now, but as an additional note, we do not have access to the outside world from the compute nodes, eg, if somewhere the script tries to access huggingface, etc, then it won't be able to do so, could that be one reason the model get's stuck? I

hassanzadeh commented 1 month ago

before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.

Also, when I kill the process, the traceback is as follows,

Traceback (most recent call last):
  File "examples/save_sharded_state.py", line 75, in <module>
    main(args)
  File "examples/save_sharded_state.py", line 55, in main
    llm = LLM(**dataclasses.asdict(engine_args))
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 149, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 414, in from_engine_args
    engine = cls(
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 243, in __init__
    self.model_executor = executor_class(
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/executor_base.py", line 42, in __init__
    self._init_executor()
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
    self._run_workers("load_model",
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/worker/worker.py", line 133, in load_model
    self.model_runner.load_model()
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 243, in load_model
    self.model = get_model(
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/loader.py", line 270, in load_model
    model.load_weights(
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 443, in load_weights
    for name, loaded_weight in weights:
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 369, in pt_weights_iterator
    state = torch.load(bin_file, map_location="cpu")
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1025, in load
    return _load(opened_zipfile,
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1446, in _load
    result = unpickler.load()
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1416, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1381, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage

Sounds like it is pending reading from storage or something related to that.

hassanzadeh commented 1 month ago

Alright, looks like the issue was indeed storage, just one question, sharding with quantization=None means no quantization, is that right? I don't want to have the exact weights sharded without any change.

youkaichao commented 1 month ago

which sharding script do you use?

hassanzadeh commented 1 month ago

The one in the example directory save_shard*