Open lmx760581375 opened 6 months ago
I found out he can only be found in tensor_parallel_size>1 , the synchronization of multiple nodes is faulty, which corresponds to comm.all_reduce(input, op)
Did you solve it? Same environment and same error. I set the tensor_parallel_size=2
Did you solve it? Same environment and same error. I set the tensor_parallel_size=2
yeah, I found that this is related to the version of nccl, you have to use your corresponding cuda version of the compiled nccl, it is inferred that the troch version of the nccl can not find the actual function under /usr/local/nccl, resulting in cuda graph calculation error. You can go to NVIDIA's official website to download and reinstall the corresponding version of nccl. I reinstalled an 11.8 compiled nccl and it worked successfully.
thanks for your replay. My host nccl is 2.15+cu118
, torch nccl version is 2.18.6(use torch.cuda.nccl.version())
, vllm is 0.4.0, and it shows vLLM is using nccl==2.70.8
.
Did you reinstall an 11.8 compiled nccl and set as ENV? And thanks for your experience
When you start using vllm, you will see two versions of nccl appear, one for torch, which seems to come from pynccl used by torch, but in fact ends up calling your host's version of nccl. If your log prints another version of nccl that is actually used but not your host version, there may be a problem with your environment variable, and there may be multiple versions of nccl on your host.
When you start using vllm, you will see two versions of nccl appear, one for torch, which seems to come from pynccl used by torch, but in fact ends up calling your host's version of nccl. If your log prints another version of nccl that is actually used but not your host version, there may be a problem with your environment variable, and there may be multiple versions of nccl on your host.
Thanks! There are 4 cuda versions in my host that cause nccl wrong versions for torch
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
when i run starcoder2, error come out:
2024-04-28 20:49:41,941 INFO worker.py:1752 -- Started a local Ray instance. INFO 04-28 20:49:43 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/apdcephfs_cq10/share_1567347/share_info/llm_models/starcoder2-15b', tokenizer='/apdcephfs_cq10/share_1567347/share_info/llm_models/starcoder2-15b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) /usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py:87: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. driver_ip = get_ip() (RayWorkerVllm pid=62031) /usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py:48: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. (RayWorkerVllm pid=62031) return get_ip() INFO 04-28 20:49:50 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-28 20:49:50 selector.py:25] Using XFormers backend. (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 selector.py:25] Using XFormers backend. INFO 04-28 20:49:52 pynccl_utils.py:45] vLLM is using nccl==2.10.3 (RayWorkerVllm pid=62111) INFO 04-28 20:49:52 pynccl_utils.py:45] vLLM is using nccl==2.10.3 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Using network Socket ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.18.6+cuda11.8 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Using network Socket ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO comm 0x53f905b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x605ed9e94f174b - Init START ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO P2P Chunksize set to 131072 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62307 [0] NCCL INFO comm 0x53f905b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x605ed9e94f174b - Init COMPLETE NCCL version 2.10.3+cuda11.0 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1b000] via direct shared memory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1b000] via direct shared memory ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO comm 0x5483be20 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Bootstrap : Using eth1:9.91.2.209<0> (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NET/Socket : Using [0]eth1:9.91.2.209<0> (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Using network Socket (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Channel 00 : 1[1b000] -> 0[1a000] via direct shared memory (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Channel 01 : 1[1b000] -> 0[1a000] via direct shared memory (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Connected all rings (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO Connected all trees (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO Launch mode Parallel (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO comm 0xbd04190 rank 1 nranks 2 cudaDev 1 busId 1b000 - Init COMPLETE INFO 04-28 20:50:02 model_runner.py:104] Loading model weights took 14.8672 GB ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Using network Socket (RayWorkerVllm pid=62111) INFO 04-28 20:50:12 model_runner.py:104] Loading model weights took 14.8672 GB ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO comm 0x98440b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x59afa392e79b7504 - Init START ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 00/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 01/02 : 0 1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO P2P Chunksize set to 131072 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Connected all rings ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO Connected all trees ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ts-1580615599d3449d98cf56a265c10977-worker-6:56890:62379 [0] NCCL INFO comm 0x98440b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 commId 0x59afa392e79b7504 - Init COMPLETE INFO 04-28 20:50:20 ray_gpu_executor.py:240] # GPU blocks: 15176, # CPU blocks: 6553 INFO 04-28 20:50:23 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 04-28 20:50:23 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing
gpu_memory_utilization
or enforcing eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage. (RayWorkerVllm pid=62111) INFO 04-28 20:50:23 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (RayWorkerVllm pid=62111) INFO 04-28 20:50:23 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasinggpu_memory_utilization
or enforcing eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] enqueue.cc:267 NCCL WARN Cuda failure 'dependency created on uncaptured work in another stream' ts-1580615599d3449d98cf56a265c10977-worker-6:56890:56890 [0] NCCL INFO enqueue.cc:1045 -> 1 Traceback (most recent call last): File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture hidden_states = self.model( File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 260, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 219, in forward hidden_states = self.embed_tokens(input_ids) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward output = tensor_model_parallel_all_reduce(output_parallel) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce pynccl_utils.allreduce(input) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 55, in all_reduce comm.allreduce(input, op) File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 258, in all_reduce assert result == 0 AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "lua_test_file_gen_vllm.py", line 221, in
main()
File "lua_test_file_gen_vllm.py", line 105, in main
llm = LLM(model=args.model, tensor_parallel_size=args.num_gpus, dtype="float16", gpu_memory_utilization=0.9)
File "/usr/local/python/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 112, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args
engine = cls(
File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 110, in init
self.model_executor = executor_class(model_config, cache_config,
File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 65, in init
self._init_cache()
File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 253, in _init_cache
self._run_workers("warm_up_model")
File "/usr/local/python/lib/python3.8/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/worker.py", line 167, in warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/usr/local/python/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 854, in capture_model
graph_runner.capture(
File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture
hidden_states = self.model(
File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 197, in exit
self.cuda_graph.capture_end()
File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 88, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.(RayWorkerVllm pid=62111) (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] enqueue.cc:267 NCCL WARN Cuda failure 'dependency created on uncaptured work in another stream' (RayWorkerVllm pid=62111) ts-1580615599d3449d98cf56a265c10977-worker-6:62111:62111 [1] NCCL INFO enqueue.cc:1045 -> 1 (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Error executing method warm_up_model. This might cause deadlock in distributed execution. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(*args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 260, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model(input_ids, positions, kv_caches, (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(*args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/models/starcoder2.py", line 219, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.embed_tokens(input_ids) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return self._call_impl(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return forward_call(args, kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] output = tensor_model_parallel_all_reduce(output_parallel) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/communication_op.py", line 35, in tensor_model_parallel_all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] pynccl_utils.allreduce(input) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 55, in all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] comm.allreduce(input, op) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 258, in all_reduce (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] assert result == 0 (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] AssertionError (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] During handling of the above exception, another exception occurred: (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return executor(*args, *kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/worker.py", line 167, in warm_up_model (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] self.model_runner.capture_model(self.gpu_cache) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] return func(args, **kwargs) (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 854, in capture_model (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] graph_runner.capture( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 921, in capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] hidden_states = self.model( (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 197, in exit (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] self.cuda_graph.capture_end() (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] File "/usr/local/python/lib/python3.8/site-packages/torch/cuda/graphs.py", line 88, in capture_end (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] super().capture_end() (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] RuntimeError: CUDA error: operation failed due to a previous error during capture (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions. (RayWorkerVllm pid=62111) ERROR 04-28 20:50:23 ray_utils.py:44] (RayWorkerVllm pid=62111) /usr/local/python/lib/python3.8/site-packages/vllm/engine/ray_utils.py:48: UserWarning: Failed to get the IP address, using 0.0.0.0 by default.The value can be set by the environment variable HOST_IP. (RayWorkerVllm pid=62111) return get_ip()