Open laithsakka opened 1 month ago
the undefined symbol cuTensorMapEncodeTiled is solved by workaround export LD_PRELOAD=/usr/lib64/libcuda.so `` nm -C _C.abi3.so | grep "cuTensorMap" U cuTensorMapEncodeTiled the workaround that you suggested to export LD_PRELOAD=/usr/lib64/libcuda.so
import logging File "/home/lsakka/vllm/vllm/logging/init.py", line 1, in
import logging
goes to vllm/logging
, which is caused by incorret python path i think. it should go to python builtin logging module.
I resolved the issue , I was running python from vllm/vllm folder and python was confused for now I can import vllm with not issue running python benchmark_latency.py --model CohereForAI/c4ai-command-r-v01 -tp=4 --batch-size=1 hands crossed !
i get the following messages and running vllm stucks on the benchmark
Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3656580) ERROR 07-14 19:17:50 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
th1991) [lsakka@devgpu002.ash8 ~/vllm/benchmarks (main)]$ python benchmark_latency.py --model CohereForAI/c4ai-command-r-v01 -tp=4 --batch-size=1
WARNING 07-14 19:17:44 _custom_ops.py:14] Failed to import from vllm._C with ImportError('/home/lsakka/vllm/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
Namespace(model='CohereForAI/c4ai-command-r-v01', speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=4, input_len=32, output_len=128, batch_size=1, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=30, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='auto', block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format='auto', distributed_executor_backend=None, otlp_traces_endpoint=None)
INFO 07-14 19:17:45 config.py:696] Defaulting to use mp for distributed inference
INFO 07-14 19:17:45 llm_engine.py:174] Initializing an LLM engine (v0.5.1) with config: model='CohereForAI/c4ai-command-r-v01', speculative_config=None, tokenizer='CohereForAI/c4ai-command-r-v01', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=CohereForAI/c4ai-command-r-v01, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-14 19:17:46 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-14 19:17:46 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=3656571) INFO 07-14 19:17:46 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=3656580) INFO 07-14 19:17:46 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=3656571) INFO 07-14 19:17:46 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=3656580) INFO 07-14 19:17:46 selector.py:53] Using XFormers backend.
(VllmWorkerProcess pid=3656573) INFO 07-14 19:17:46 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
(VllmWorkerProcess pid=3656573) INFO 07-14 19:17:46 selector.py:53] Using XFormers backend.
/home/lsakka/xformers/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/lsakka/xformers/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
/home/lsakka/xformers/xformers/ops/swiglu_op.py:127: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_fwd
/home/lsakka/xformers/xformers/ops/swiglu_op.py:148: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_bwd
(VllmWorkerProcess pid=3656580) /home/lsakka/xformers/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656580) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=3656580) /home/lsakka/xformers/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656580) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3656580) /home/lsakka/xformers/xformers/ops/swiglu_op.py:127: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656580) @torch.cuda.amp.custom_fwd
(VllmWorkerProcess pid=3656580) /home/lsakka/xformers/xformers/ops/swiglu_op.py:148: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656580) @torch.cuda.amp.custom_bwd
(VllmWorkerProcess pid=3656580) INFO 07-14 19:17:49 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3656571) /home/lsakka/xformers/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656571) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=3656571) /home/lsakka/xformers/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656571) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3656573) /home/lsakka/xformers/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656573) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=3656573) /home/lsakka/xformers/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=3656573) @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=3656571) /home/lsakka/xformers/xformers/ops/swiglu_op.py:127: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656571) @torch.cuda.amp.custom_fwd
(VllmWorkerProcess pid=3656571) /home/lsakka/xformers/xformers/ops/swiglu_op.py:148: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656571) @torch.cuda.amp.custom_bwd
(VllmWorkerProcess pid=3656571) INFO 07-14 19:17:49 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3656573) /home/lsakka/xformers/xformers/ops/swiglu_op.py:127: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656573) @torch.cuda.amp.custom_fwd
(VllmWorkerProcess pid=3656573) /home/lsakka/xformers/xformers/ops/swiglu_op.py:148: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
(VllmWorkerProcess pid=3656573) @torch.cuda.amp.custom_bwd
(VllmWorkerProcess pid=3656573) INFO 07-14 19:17:49 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3656571) INFO 07-14 19:17:50 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-14 19:17:50 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3656571) ERROR 07-14 19:17:50 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(VllmWorkerProcess pid=3656573) INFO 07-14 19:17:50 utils.py:741] Found nccl from library libnccl.so.2
ERROR 07-14 19:17:50 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(VllmWorkerProcess pid=3656573) ERROR 07-14 19:17:50 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(VllmWorkerProcess pid=3656580) INFO 07-14 19:17:50 utils.py:741] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3656580) ERROR 07-14 19:17:50 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.12.0-0_fbk16_zion_7661_geb00762ce6d2-x86_64-with-glibc2.34.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
how does your nightly pytorch use NCCL? static link or dynamic link? which NCCL does it use?
I fixed that by setting the version of the nccl libarary
VLLM_NCCL_SO_PATH=/home/lsakka/pytorch/build/nccl/lib/libnccl.so.2 python benchmark_latency.py --model CohereForAI/c4ai-command-r-v01 -tp=4 --batch-size=1
the benchmark still gets stuck though, i have seen local posts about nccl getting stuck and having tor revert not sure if related. the current version i use is NCCL version 2.21.5+cuda12.1
can you try to follow the debugging guide https://docs.vllm.ai/en/latest/getting_started/debugging.html ? there is a sanity check script to help you locate the problem.
I used the VLLM_TRACE and it get stuck at 2024-07-16 13:56:37.432732 Call to wait_until_ready in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:284 from create_from_process_group in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:489
note the following warning which i am unusre if it related
llmWorkerProcess pid=1194815) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1194818) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=1194814) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
the warning is irrelevant.
can you stably reproduce this hang? if so, you can remove VLLM_TRACE_FUNCTION
, and set a breakpoint in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:284
to see which line hangs.
that part of code sets up a publish-subscribe message queue for communication.
I used the VLLM_TRACE and it get stuck at 2024-07-16 13:56:37.432732 Call to wait_until_ready in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:284 from create_from_process_group in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:489
note the following warning which i am unusre if it related
llmWorkerProcess pid=1194815) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=1194818) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=1194814) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
+1 hanged there at wait_until_ready()
vllm 0.5.2 run in Docker on host network
I used the VLLM_TRACE and it get stuck at 2024-07-16 13:56:37.432732 Call to wait_until_ready in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:284 from create_from_process_group in /home/lsakka/vllm/vllm/distributed/device_communicators/shm_broadcast.py:489 note the following warning which i am unusre if it related
llmWorkerProcess pid=1194815) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=1194818) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=1194814) WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-16 14:31:56 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
+1 hanged there at
wait_until_ready()
vllm 0.5.2 run in Docker on host network
It turns out that using host network on a IPv6 only machine is problematic. Bridged network works just fine on my side
Your current environment
why is it important: This is a prerequisite to the work on enabling troch.compile on vllm, we need to be able to build vllm with nightly so that we can iterate on changes and try features that are not released yet.
current error: Failed to import from vllm._C with ImportError('/home/lsakka/vllm/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')
any idea what this could be? It was mentioned that vllm was struggling to upgrade one step version
diff file
How you are installing vllm
what did I do:
current error: Failed to import from vllm._C with ImportError('/home/lsakka/vllm/vllm/_C.abi3.so: undefined symbol: cuTensorMapEncodeTiled')