[Bug]: I'm trying to run Pixtral-Large-Instruct-2411 using vllm, following the documentation at https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411, but I encountered an error.

eii-lyl commented 1 day ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 550.127.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 52 Socket(s): 2 Stepping: 8 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 6.5 MiB (208 instances) L1i cache: 6.5 MiB (208 instances) L2 cache: 416 MiB (104 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-103 NUMA node1 CPU(s): 104-207 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Versions of relevant libraries: [pip3] flake8==4.0.1 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] optree==0.13.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.1 [pip3] torchvision==0.20.1 [pip3] transformers==4.46.3 [pip3] triton==3.1.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS 104-207 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS 104-207 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS 104-207 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS 104-207 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NCCL_IB_DISABLE=1 LD_LIBRARY_PATH=/home/ubuntu/.local/lib/python3.10/site-packages/cv2/../../lib64: CUDA_MODULE_LOADING=LAZY ```

Model Input Dumps

No response

🐛 Describe the bug

Command: 'vllm serve mistralai/Pixtral-Large-Instruct-2411 --config-format mistral --load-format mistral --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8'

Full Log:

ubuntu@192-222-52-186:~$ vllm serve mistralai/Pixtral-Large-Instruct-2411 --config-format mistral --max-model-len 8192 --load-format mistral --tokenizer_mode mistral --limit_mm_per_prompt 'image=2' --tensor-parallel-size 8 2024-11-21 02:51:57.103583: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-11-21 02:51:57.114316: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:51:57.127862: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:51:57.131683: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:51:57.140952: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" INFO 11-21 02:51:59 api_server.py:585] vLLM API server version 0.6.4 INFO 11-21 02:51:59 api_server.py:586] args: Namespace(subparser='serve', model_tag='mistralai/Pixtral-Large-Instruct-2411', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='mistralai/Pixtral-Large-Instruct-2411', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='mistral', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='mistral', config_format='mistral', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 2}, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7027ec82b760>) INFO 11-21 02:51:59 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/97d8ecb5-3898-4723-a05f-557905c58248 for IPC Path. INFO 11-21 02:51:59 api_server.py:194] Started engine process with PID 26719 INFO 11-21 02:52:00 config.py:1861] Downcasting torch.float32 to torch.float16. 2024-11-21 02:52:02.003750: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:02.016856: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:02.020606: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" INFO 11-21 02:52:04 config.py:1861] Downcasting torch.float32 to torch.float16. INFO 11-21 02:52:10 config.py:1020] Defaulting to use mp for distributed inference WARNING 11-21 02:52:10 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. INFO 11-21 02:52:17 config.py:1020] Defaulting to use mp for distributed inference WARNING 11-21 02:52:17 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. INFO 11-21 02:52:17 llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config: model='mistralai/Pixtral-Large-Instruct-2411', speculative_config=None, tokenizer='mistralai/Pixtral-Large-Instruct-2411', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.MISTRAL, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistralai/Pixtral-Large-Instruct-2411, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None) WARNING 11-21 02:52:17 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-21 02:52:17 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager INFO 11-21 02:52:17 selector.py:135] Using Flash Attention backend. 2024-11-21 02:52:22.805915: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.811565: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.818349: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.821875: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.824652: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.828287: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.896583: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.896915: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.896972: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.896978: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.897011: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 02:52:22.908970: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.909146: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.909256: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.909295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.909345: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 02:52:22.912432: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.912563: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.912683: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.912747: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 02:52:22.912792: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" /usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" (VllmWorkerProcess pid=27340) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27340) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27344) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27344) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27345) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27345) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27339) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27339) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27343) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27343) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27342) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27342) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=27341) INFO 11-21 02:52:25 selector.py:135] Using Flash Attention backend. (VllmWorkerProcess pid=27341) INFO 11-21 02:52:25 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27339) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27339) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27340) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27341) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27342) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27345) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27340) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27344) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27343) INFO 11-21 02:52:30 utils.py:960] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=27345) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27342) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27341) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27344) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=27343) INFO 11-21 02:52:30 pynccl.py:69] vLLM is using nccl==2.21.5 Task exception was never retrieved future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/home/ubuntu/.local/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Task exception was never retrieved future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/home/ubuntu/.local/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "/home/ubuntu/.local/bin/vllm", line 8, in sys.exit(main()) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/ubuntu/.local/lib/python3.10/site-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/home/ubuntu/.local/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper return await main File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 1 day ago

Can you update your dependencies? I see this warning:

warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4

eii-lyl commented 1 day ago

Can you update your dependencies? I see this warning:

warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4

Sure. I have reopen a cloud server with 8 H100. And I installed vllm==0.6.4.post1 and downgraded numpy to 1.24.0.

The output of `pip list`

```text Package Version --------------------------------- ------------------------------------ absl-py 2.1.0 aiohappyeyeballs 2.4.3 aiohttp 3.11.6 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.2.post1 appdirs 1.4.4 argcomplete 1.8.1 astunparse 1.6.3 async-timeout 5.0.1 attrs 24.2.0 Automat 20.2.0 Babel 2.8.0 backcall 0.2.0 bcrypt 3.2.0 beautifulsoup4 4.10.0 beniget 0.4.1 bleach 4.1.0 blinker 1.4 bottle 0.12.19 Brotli 1.0.9 certifi 2020.6.20 cffi 1.15.0 chardet 4.0.0 charset-normalizer 3.4.0 click 8.0.3 cloud-init 24.3.1 cloudpickle 3.1.0 colorama 0.4.4 command-not-found 0.3 commonmark 0.9.1 compressed-tensors 0.8.0 configobj 5.0.6 constantly 15.1.0 cryptography 3.4.8 ctop 1.0.0 cycler 0.11.0 datasets 3.1.0 dbus-python 1.2.18 decorator 4.4.2 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distlib 0.3.4 distro 1.7.0 distro-info 1.1+ubuntu0.2 docker 5.0.3 einops 0.8.0 entrypoints 0.4 exceptiongroup 1.2.2 fastapi 0.115.5 filelock 3.16.1 flake8 4.0.1 flatbuffers 1.12.1-git20200711.33e2d80-dfsg1-0.6 fonttools 4.29.1 frozenlist 1.5.0 fs 2.4.12 fsspec 2024.3.1 future 0.18.2 gast 0.5.2 gguf 0.10.0 Glances 3.2.4.2 google-pasta 0.2.0 grpcio 1.30.2 h11 0.14.0 h5py 3.6.0 h5py.-debian-h5py-serial 3.6.0 html5lib 1.1 httpcore 1.0.7 httplib2 0.20.2 httptools 0.6.4 httpx 0.27.2 huggingface-hub 0.26.2 hyperlink 21.0.0 icdiff 2.0.4 idna 3.3 importlib-metadata 4.6.4 incremental 21.3.0 influxdb 5.3.1 interegular 0.3.3 iotop 0.6 ipykernel 6.7.0 ipython 7.31.1 ipython_genutils 0.2.0 jax 0.4.30 jaxlib 0.4.30 jedi 0.18.0 jeepney 0.7.1 Jinja2 3.0.3 jiter 0.7.1 joblib 0.17.0 jsonpatch 1.32 jsonpointer 2.0 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 jupyter-client 7.1.2 jupyter-core 4.9.1 kaptan 0.5.12 keras 3.6.0 keyring 23.5.0 kiwisolver 1.3.2 lark 1.2.2 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 libtmux 0.10.1 livereload 2.6.3 llvmlite 0.43.0 lm-format-enforcer 0.10.9 lxml 4.8.0 lz4 3.1.3+dfsg Markdown 3.3.6 MarkupSafe 2.0.1 matplotlib 3.5.1 matplotlib-inline 0.1.3 mccabe 0.6.1 mistral_common 1.5.1 mkdocs 1.1.2 ml-dtypes 0.3.1 more-itertools 8.10.0 mpmath 1.3.0 msgpack 1.0.3 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 namex 0.0.8 nest-asyncio 1.5.4 netifaces 0.11.0 networkx 2.4 numba 0.60.0 numpy 1.24.0 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 oauthlib 3.2.0 olefile 0.46 openai 1.55.0 opencv-python-headless 4.10.0.84 opt-einsum 3.3.0 optree 0.13.0 outlines 0.0.46 packaging 21.3 pandas 1.3.5 parso 0.8.1 partial-json-parser 0.2.1.1.post4 pexpect 4.8.0 pickleshare 0.7.5 pillow 10.4.0 pip 22.0.2 pipx 1.0.0 platformdirs 2.5.1 ply 3.11 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.28 propcache 0.2.0 protobuf 4.21.12 psutil 5.9.0 ptyprocess 0.7.0 py 1.10.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 18.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.1 pycodestyle 2.8.0 pycountry 24.6.1 pycparser 2.21 pycryptodomex 3.11.0 pydantic 2.10.0 pydantic_core 2.27.0 pyflakes 2.4.0 Pygments 2.11.2 PyGObject 3.42.1 PyHamcrest 2.0.2 pyinotify 0.9.6 PyJWT 2.3.0 pyOpenSSL 21.0.0 pyparsing 2.4.7 pyrsistent 0.18.1 pyserial 3.5 pysmi 0.3.2 pysnmp 4.4.12 pystache 0.6.0 python-apt 2.4.0+ubuntu4 python-dateutil 2.8.1 python-dotenv 1.0.1 python-magic 0.4.24 pythran 0.10.0 pytz 2022.1 PyYAML 5.4.1 pyzmq 26.2.0 ray 2.39.0 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 rich 11.2.0 rpds-py 0.21.0 safetensors 0.4.5 scikit-learn 0.23.2 scipy 1.8.0 SecretStorage 3.3.1 sentencepiece 0.2.0 service-identity 18.1.0 setuptools 59.6.0 six 1.16.0 sniffio 1.3.1 sos 4.7.2 soupsieve 2.3.1 ssh-import-id 5.11 starlette 0.41.3 sympy 1.13.1 tensorboard 2.17.0 tensorflow 2.17.0 termcolor 1.1.0 threadpoolctl 3.1.0 tiktoken 0.7.0 tmuxp 1.9.2 tokenizers 0.20.3 torch 2.5.1 torchvision 0.20.1 tornado 6.1 tqdm 4.67.0 traitlets 5.1.1 transformers 4.46.3 triton 3.1.0 Twisted 22.1.0 typing_extensions 4.12.2 ufoLib2 0.13.1 ufw 0.36.1 unattended-upgrades 0.1 unicodedata2 14.0.0 urllib3 1.26.5 userpath 1.8.0 uvicorn 0.32.1 uvloop 0.21.0 virtualenv 20.13.0+ds vllm 0.6.4.post1 wadllib 1.3.6 watchfiles 0.24.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.2.3 websockets 14.1 Werkzeug 2.0.2 wheel 0.37.1 wrapt 1.13.3 xformers 0.0.28.post3 xxhash 3.5.0 yarl 1.17.2 zipp 1.0.0 zope.interface 5.4.0 ```

The output of `python collect_env.py`

```text 2024-11-21 04:31:51.534765: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2024-11-21 04:31:51.545644: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-21 04:31:51.558947: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-21 04:31:51.562842: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-21 04:31:51.572038: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI, in other operations, rebuild TensorFlow with the appropriate compiler flags. Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 550.127.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 208 On-line CPU(s) list: 0-207 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 52 Socket(s): 2 Stepping: 8 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 6.5 MiB (208 instances) L1i cache: 6.5 MiB (208 instances) L2 cache: 416 MiB (104 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-103 NUMA node1 CPU(s): 104-207 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Versions of relevant libraries: [pip3] flake8==4.0.1 [pip3] numpy==1.24.0 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] optree==0.13.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.1 [pip3] torchvision==0.20.1 [pip3] transformers==4.46.3 [pip3] triton==3.1.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.4.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS 0-103 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS 104-207 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS 104-207 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS 104-207 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS 104-207 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NCCL_IB_DISABLE=1 LD_LIBRARY_PATH=/home/ubuntu/.local/lib/python3.10/site-packages/cv2/../../lib64: CUDA_MODULE_LOADING=LAZY ```

ubuntu@192-222-53-70:~$ vllm serve mistralai/Pixtral-Large-Instruct-2411 --config-format mistral --load-format mistral --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8
2024-11-21 04:27:30.371946: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-21 04:27:30.382914: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:30.396353: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:30.400130: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:30.409408: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 11-21 04:27:32 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-21 04:27:32 api_server.py:586] args: Namespace(subparser='serve', model_tag='mistralai/Pixtral-Large-Instruct-2411', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='mistralai/Pixtral-Large-Instruct-2411', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='mistral', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='mistral', config_format='mistral', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 10}, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7b1338a7e050>)
INFO 11-21 04:27:32 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/b4cb297b-963f-4173-9908-78648261e38e for IPC Path.
INFO 11-21 04:27:32 api_server.py:194] Started engine process with PID 9864
INFO 11-21 04:27:33 config.py:1861] Downcasting torch.float32 to torch.float16.
2024-11-21 04:27:35.311679: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:35.324952: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:35.328706: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 11-21 04:27:38 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-21 04:27:42 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-21 04:27:42 arg_utils.py:1023] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 11-21 04:27:42 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 04:27:48 config.py:1020] Defaulting to use mp for distributed inference
WARNING 11-21 04:27:48 arg_utils.py:1023] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 11-21 04:27:48 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 04:27:48 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='mistralai/Pixtral-Large-Instruct-2411', speculative_config=None, tokenizer='mistralai/Pixtral-Large-Instruct-2411', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.MISTRAL, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistralai/Pixtral-Large-Instruct-2411, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
WARNING 11-21 04:27:49 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 104 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-21 04:27:49 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-21 04:27:49 selector.py:135] Using Flash Attention backend.
2024-11-21 04:27:53.992586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:53.992587: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:53.992586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:53.992585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:53.992585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:53.992584: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:54.005294: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.005294: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.005298: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.005295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.005295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.005311: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.008847: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.008857: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.008860: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.008859: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.008860: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.008859: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 04:27:54.013482: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 04:27:54.026505: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 04:27:54.030518: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(VllmWorkerProcess pid=10395) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10395) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10394) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10394) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10397) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10397) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10391) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10391) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10396) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10396) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10392) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10392) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=10393) INFO 11-21 04:27:56 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=10393) INFO 11-21 04:27:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10391) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10394) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10391) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10392) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10394) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10393) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10396) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10395) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10397) INFO 11-21 04:28:01 utils.py:961] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=10392) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10396) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10395) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10397) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=10393) INFO 11-21 04:28:01 pynccl.py:69] vLLM is using nccl==2.21.5
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/home/ubuntu/.local/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/home/ubuntu/.local/lib/python3.10/site-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
    args.dispatch_function(args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
    uvloop.run(run_server(args))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/ubuntu/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

DarkLight1337 commented 1 day ago

Can you try launching with --disable-frontend-multiprocessing? It might solve the issue, and if not, at least provide a more detailed stack trace.

eii-lyl commented 1 day ago

Can you try launching with --disable-frontend-multiprocessing? It might solve the issue, and if not, at least provide a more detailed stack trace.

I have tried that. The error becomes 'segmentation fault'. Nothing more

DarkLight1337 commented 1 day ago

Please follow the troubleshooting guide and see if can help find the line that is causing this segfault.

eii-lyl commented 1 day ago

OK. Thank you.

vllm-project / vllm