Open TypeFloat opened 3 months ago
Having the same issue. Cannot even do tensor_parallel_size=2
same. cannot do on 2 gpus.
same. cannot do on 2 gpus.
ps: actually it can be run on 2 gpus. but when i run nsys profile along with it, the bug occured. i suspect it is something todo with ram.
Any update on this? I tried with NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0
, but even with that, it is not working.
Any update on this? I tried with
NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0
, but even with that, it is not working.
same... i guess u may try with smaller model
same issue. My nsys version NVIDIA Nsight Systems version 2024.5.1.113-245134619542v0 and vllm0.5.4@e6e42e
same issue. ERROR 08-15 19:14:30 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 65117 died, exit code: -11
@vlllm devs,
Just checking in on this issue since a few of us are experiencing it. If there’s anything we can do to help move it forward, please let us know. Thanks for all your hard work!
Apologies I had made some progress on this will get back to it today/tomorrow.
emegency! emegency! emegency!
家人们,我成功了!启动的时候设置下两个参数,参考如下:
llm = LLM(model=model_path, tokenizer=model_path, max_num_batched_tokens=32768, max_model_len=32768, tensor_parallel_size=8, trust_remote_code=True, load_format = "auto", enforce_eager=True, \ ray_workers_use_nsight=True, distributed_executor_backend="ray" )
关键参数: ray_workers_use_nsight=True, distributed_executor_backend="ray"
使用的 nsys 版本如下: https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2024_4/NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
vllm==0.5.4
I found this bug exist from v0.4.3,0.4.2 is ok
家人们,我成功了!启动的时候设置下两个参数,参考如下:
llm = LLM(model=model_path, tokenizer=model_path, max_num_batched_tokens=32768, max_model_len=32768, tensor_parallel_size=8, trust_remote_code=True, load_format = "auto", enforce_eager=True, \ ray_workers_use_nsight=True, distributed_executor_backend="ray" )
关键参数:
ray_workers_use_nsight=True, distributed_executor_backend="ray"
使用的 nsys 版本如下: https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2024_4/NsightSystems-linux-cli-public-2024.4.1.61-3431596.deb
vllm==0.5.4
感谢您的提醒!但是后面执行benchmark_serving的时候出现了段错误,您有遇到吗?
Your current environment
🐛 Describe the bug
When I use Nsight system to record the profile which program runs in multi-GPU, the error occurred.
Take a look at examples/offline_inference.py, when I use the QWen/QWen-72B LLM, while configure is
And the running command is
nsys profile python offline_inference.py
, the error occurred. I'm sure the the script has no bug becauce, when runningpython offline_inference.py
, there is no bug.Furthermore, I think that there may be some bugs in multi-GPU environment, so I changed the configure of LLM
and run the script with
nsys profile python offline_inference.py
. There is no bug.