Open vibhas-singh opened 3 months ago
Please go through this troubleshooting guide and see if it can resolve your issue.
@DarkLight1337 I tried turning on the debugging using the flags - though I am not observing anything weird.
Sharing the logs here from the console (have redacted the IP):
INFO 07-30 16:24:13 llm_engine.py:176] Initializing an LLM engine (v0.5.3) with config: model='merged_model_1', speculative_config=None, tokenizer='code-llama-7b-text-to-sql', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=merged_model_1, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-30 16:24:14 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-30 16:24:14 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-e4e21a3b888745af8bd6ea332ced22e3/VLLM_TRACE_FUNCTION_for_process_15863_thread_140205088794432_at_2024-07-30_16:24:14.428276.log
DEBUG 07-30 16:24:17 parallel_state.py:803] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://XXX.XX.XX.XX:57753 backend=nccl
INFO 07-30 16:24:17 model_runner.py:680] Starting to load model merged_model_1...
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 14% Completed | 1/7 [08:12<49:16, 492.67s/it]
The sanity check is also successful.
@youkaichao can you help with this?
The log shows it is loading model. So might be a disk io problem here?
Apologies for the naive question - but it there a way to identify the disk IO problems?
Is it possible to have a sanity check
like thing for disk IO as well - like we have for Incorrect hardware/driver
in the troubleshooting guide?
Apologies for the naive question - but it there a way to identify the disk IO problems? Is it possible to have a
sanity check
like thing for disk IO as well - like we have forIncorrect hardware/driver
in the troubleshooting guide?
Bro, did you find out the answer? I also met this slow loading problem. I use HDD but I noticed that the models were already loaded into the GPU, what the program did was always waiting
Apologies for the naive question - but it there a way to identify the disk IO problems? Is it possible to have a
sanity check
like thing for disk IO as well - like we have forIncorrect hardware/driver
in the troubleshooting guide?
I tried to put the models to SSD, the problem solved... The loading time reduced to 2min comparing to 60min with HDD...
@Ziyi6 Not really.
I am using Amazon Sagemaker Notebook instances and from what I can infer from docs it - its already using general purpose SSD for storage. Refer: https://repost.aws/questions/QUsip6vrGlRQOhlyEjZNeEMg/ebs-volume-type-for-sagemaker-notebook-instances
I am yet to explore a way to check the actual IO speed I am getting - that is why I requested the maintainers if it possible to add a debugging script for slow IO as well.
Your current environment
🐛 Describe the bug
I have fine-tuned a llama 7B model using transformers and QLoRA. Then I have merged the LoRA weights to the base model and saved it to the disk. The saved weights look something like this:
Now I am trying to load these weights using the python API of vLLM using the following code:
It is almost taking forever to load all the shards. The progress bar shows the estimated load time to be 51 mins. Loading one shard is almost taking 9 mins.
Few additional points:
transformers
using the following code, the first load takes around 12 mins. All the subsequent model loads take 10-20 secs only.