Open zzlgreat opened 6 months ago
@zzlgreat same error for me too.
I get the same first line of the error above: _"Error executing method determine_num_availableblocks. This might cause deadlock in distributed execution". If I set the following environmental variable VLLM loads with no errors: NCCL_SOCKET_IFNAME=eth0
same error for me on amd64 ,maybe there is some cuda toolkit need ?,and may amd64 is not supported...
Your current environment
🐛 Describe the bug
(vllm) root@4090:/DATA4T/text-generation-webui/vllm# python -m vllm.entrypoints.openai.api_server --model /DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ --tensor-parallel-size 4 --enforce-eager INFO 04-15 07:27:04 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 INFO 04-15 07:27:05 api_server.py:149] vLLM API server version 0.4.0.post1 INFO 04-15 07:27:05 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, tensorizer_uri=None, verify_hash=False, encryption_keyfile=None, num_readers=1, s3_access_key_id=None, s3_secret_access_key=None, s3_endpoint=None, vllm_tensorized=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 04-15 07:27:05 config.py:225] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2024-04-15 07:27:07,765 INFO worker.py:1752 -- Started a local Ray instance. INFO 04-15 07:27:08 llm_engine.py:82] Initializing an LLM engine (v0.4.0.post1) with config: model='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', speculative_config=None, tokenizer='/DATA4T/text-generation-webui/models/c4ai-command-r-plus-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (pid=1812) INFO 04-15 07:27:10 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 (pid=2303) INFO 04-15 07:27:16 pynccl.py:58] Loading nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. (RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. (RayWorkerVllm pid=1969) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. (RayWorkerVllm pid=1969) [rank1]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [rank0]:[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 (RayWorkerVllm pid=1969) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (RayWorkerVllm pid=1969) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped (RayWorkerVllm pid=1969) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 04-15 07:27:23 model_runner.py:169] Loading model weights took 14.3474 GB (RayWorkerVllm pid=1969) INFO 04-15 07:27:25 model_runner.py:169] Loading model weights took 14.3474 GB (RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance. [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:16 selector.py:33] Using XFormers backend. [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster] (RayWorkerVllm pid=2303) INFO 04-15 07:27:19 custom_all_reduce.py:152] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped [repeated 2x across cluster] (RayWorkerVllm pid=2303) WARNING 04-15 07:27:19 custom_all_reduce.py:58] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster] (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] Traceback (most recent call last): (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/engine/ray_utils.py", line 43, in execute_method (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return executor(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(*args, *kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/worker.py", line 134, in determine_num_available_blocks (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.model_runner.profile_run() (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 918, in profile_run (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] self.execute_model(seqs, kv_caches) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 839, in execute_model (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = model_executable(execute_model_kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, *kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return func(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 320, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states = self.model(input_ids, positions, kv_caches, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, *kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 286, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states, residual = layer( (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 243, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] hidden_states_attention = self.self_attn( (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, *kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 208, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 rayutils.py:50] qkv, = self.qkv_proj(hidden_states) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return self._call_impl(*args, kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return forward_call(*args, *kwargs) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/linear.py", line 218, in forward (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output_parallel = self.linear_method.applyweights(self, input, bias) (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/quantization/gptq.py", line 214, in apply_weights (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] File "/DATA4T/text-generation-webui/vllm/vllm/_custom_ops.py", line 133, in gptq_gemm (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, (RayWorkerVllm pid=1969) ERROR 04-15 07:27:27 ray_utils.py:50] RuntimeError: Unknown layout Traceback (most recent call last): File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/DATA4T/text-generation-webui/vllm/vllm/entrypoints/openai/api_server.py", line 157, in
engine = AsyncLLMEngine.from_engine_args(
File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
engine = cls(
File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 311, in init
self.engine = self._init_engine( args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
return engine_class(*args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/engine/llm_engine.py", line 133, in init
self._initialize_kv_caches()
File "/DATA4T/text-generation-webui/vllm/vllm/engine/llm_engine.py", line 193, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/DATA4T/text-generation-webui/vllm/vllm/executor/ray_gpu_executor.py", line 215, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
File "/DATA4T/text-generation-webui/vllm/vllm/executor/ray_gpu_executor.py", line 313, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/worker/worker.py", line 134, in determine_num_available_blocks
self.model_runner.profile_run()
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 918, in profile_run
self.execute_model(seqs, kv_caches)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/worker/model_runner.py", line 839, in execute_model
hidden_states = model_executable(execute_model_kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, *kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 320, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, *kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 286, in forward
hidden_states, residual = layer(
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/models/commandr.py", line 243, in forward
hidden_states_attention = self.self_attn(
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/modelexecutor/models/commandr.py", line 208, in forward
qkv, = self.qkv_proj(hidden_states)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, **kwargs)
File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/linear.py", line 218, in forward
output_parallel = self.linear_method.applyweights(self, input, bias)
File "/DATA4T/text-generation-webui/vllm/vllm/model_executor/layers/quantization/gptq.py", line 214, in apply_weights
output = ops.gptq_gemm(reshaped_x, layer.qweight, layer.qzeros,
File "/DATA4T/text-generation-webui/vllm/vllm/_custom_ops.py", line 133, in gptq_gemm
return vllm_ops.gptq_gemm(a, b_q_weight, b_gptq_qzeros, b_gptq_scales,
RuntimeError: Unknown layout