Closed hassanzadeh closed 3 days ago
how much cpu memory do you have? what is your disk read speed? if you press ctrl + C, where does it stop?
I suspect it is stuck in weight loading.
Thanks CPU men: 1.8TB I tried to stopped I think it did not stop for at least a minute so killed the compute instance by force.
https://docs.vllm.ai/en/latest/getting_started/debugging.html might help to debug hang.
Hi @youkaichao, I tried to shard the model using the script in the vllm repo, unfortunately it gets stuck too, what do you think?
before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.
before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.
Thanks, I'm going to do that now, but as an additional note, we do not have access to the outside world from the compute nodes, eg, if somewhere the script tries to access huggingface, etc, then it won't be able to do so, could that be one reason the model get's stuck? I
before sharding the model, i think you need to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out where it gets stuck, to gather more information.
Also, when I kill the process, the traceback is as follows,
Traceback (most recent call last):
File "examples/save_sharded_state.py", line 75, in <module>
main(args)
File "examples/save_sharded_state.py", line 55, in main
llm = LLM(**dataclasses.asdict(engine_args))
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 149, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 414, in from_engine_args
engine = cls(
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 243, in __init__
self.model_executor = executor_class(
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
super().__init__(*args, **kwargs)
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/executor_base.py", line 42, in __init__
self._init_executor()
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 79, in _init_executor
self._run_workers("load_model",
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/worker/worker.py", line 133, in load_model
self.model_runner.load_model()
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 243, in load_model
self.model = get_model(
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/loader.py", line 270, in load_model
model.load_weights(
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 443, in load_weights
for name, loaded_weight in weights:
File "/anaconda/envs/nlp/lib/python3.8/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 369, in pt_weights_iterator
state = torch.load(bin_file, map_location="cpu")
File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1025, in load
return _load(opened_zipfile,
File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1446, in _load
result = unpickler.load()
File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1416, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/anaconda/envs/nlp/lib/python3.8/site-packages/torch/serialization.py", line 1381, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
Sounds like it is pending reading from storage or something related to that.
Alright, looks like the issue was indeed storage, just one question, sharding with quantization=None means no quantization, is that right? I don't want to have the exact weights sharded without any change.
which sharding script do you use?
The one in the example directory save_shard*
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
Your current environment
Hey Guys, I tried the open ai api server, to load a 70B Llama-3 checkpoint. I think out of the 3-4 efforts I did, only one time the model successfully loaded after about 1 our, for the other two times, nothing happened, even after 3 hours of wait time. I'm loading the model on 8xA100/80G azure nodes. Am I following the right practice? For the failed cases, cuda memory usage wont exceed 18G (it should be around 70-80G otherwise)
How would you like to use vllm