vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

[Bug]: docker 启动vllm,配置了host_IP ,还是 [W socket.cpp:663] [c10d] The client socket has failed to connect to [::ffff:172.16.8.232]:39623 (errno: 110 - Connection timed out) #3771

Open huyang19881115 opened 5 months ago

huyang19881115 commented 5 months ago

Your current environment

Collecting environment information... PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.3.107 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 Nvidia driver version: 550.54.14 cuDNN version: Probably one of the following: /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.8.0 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.8.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit 字节序: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU: 32 在线 CPU 列表: 0-31 每个核的线程数: 2 每个座的核数: 16 座: 1 NUMA 节点: 1 厂商 ID: AuthenticAMD CPU 系列: 25 型号: 33 型号名称: AMD Ryzen 9 5950X 16-Core Processor 步进: 0 Frequency boost: enabled CPU MHz: 2200.000 CPU 最大 MHz: 3400.0000 CPU 最小 MHz: 2200.0000 BogoMIPS: 6800.07 虚拟化: AMD-V L1d 缓存: 512 KiB L1i 缓存: 512 KiB L2 缓存: 8 MiB L3 缓存: 64 MiB NUMA 节点0 CPU: 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] pytorch-fid==0.3.0 [pip3] pytorch-lightning==1.5.9 [pip3] torch==1.12.1+cu113 [pip3] torch-fidelity==0.3.0 [pip3] torchmetrics==0.6.0 [pip3] torchvision==0.13.1+cu113 [pip3] triton==2.2.0 [conda] No relevant packagesROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

version: '3.9' services: vllm: image: vllm/vllm-openai:latest container_name: qwen1.5 ulimits: stack: 67108864 memlock: -1 environment:

启动后一直卡住:

huyang19881115 commented 5 months ago

qwen1.5 | INFO 04-01 08:52:19 api_server.py:148] vLLM API server version 0.4.0 qwen1.5 | INFO 04-01 08:52:19 api_server.py:149] args: Namespace(host='0.0.0.0', port=8009, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=10240, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.95, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) qwen1.5 | WARNING 04-01 08:52:19 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. qwen1.5 | INFO 04-01 08:52:19 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=10240, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) qwen1.5 | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. qwen1.5 | INFO 04-01 08:52:19 selector.py:16] Using FlashAttention backend. qwen1.5 | [W socket.cpp:663] [c10d] The client socket has failed to connect to [bonc-System-Product-Name]:46723 (errno: 22 - Invalid argument). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). qwen1.5 | [W socket.cpp:663] [c10d] The IPv4 network addresses of (fe80::c366:3233:9937:a509, 46723) cannot be retrieved (gai error: -9 - Address family for hostname not supported). ^CGracefully stopping... (press Ctrl+C again to force) [+] Stopping 0/1 ⠴ Container qwen1.5 Stopping 7.6s [+] Stopping 1/1 ✔ Container qwen1.5 Stopped 7.8s canceled (base) bonc@bonc-System-Product-Name:/data/llmservice/qwen_docker/vllm$ vim docker-compose.yml (base) bonc@bonc-System-Product-Name:/data/llmservice/qwen_docker/vllm$ docker compose up [+] Running 1/0 ✔ Container qwen1.5 Recreated 0.0s Attaching to qwen1.5 qwen1.5 | INFO 04-01 08:53:07 api_server.py:148] vLLM API server version 0.4.0 qwen1.5 | INFO 04-01 08:53:07 api_server.py:149] args: Namespace(host='0.0.0.0', port=8009, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=10240, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.95, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) qwen1.5 | WARNING 04-01 08:53:07 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. qwen1.5 | INFO 04-01 08:53:07 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer='/Qwen1.5-14B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=10240, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) qwen1.5 | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. qwen1.5 | INFO 04-01 08:53:07 selector.py:16] Using FlashAttention backend. qwen1.5 | [W socket.cpp:663] [c10d] The client socket has failed to connect to [::ffff:172.16.8.232]:45211 (errno: 110 - Connection timed out).

huyang19881115 commented 5 months ago

希望能够推出一个官方得vllm 镜像得 docker-compose.yml 文件,目前得启动得命令也不太有参考意义,这边docker 启动不了

linpan commented 3 months ago

问题是服务器 需要开启ip6