vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.8k stars 4.68k forks source link

[Usage]: Docker w/ CPU fails when defining VLLM_CPU_OMP_THREADS_BIND #10556

Closed ccruttjr closed 2 hours ago

ccruttjr commented 4 days ago

Your current environment

Details

```text PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Oracle Linux Server 8.10 (x86_64) GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3.0.1) Clang version: Could not collect CMake version: version 3.26.5 Libc version: glibc-2.28 Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-301.163.5.2.el8uek.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.6.77 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S GPU 1: NVIDIA L40S Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Gold 6448Y BIOS Model name: Intel(R) Xeon(R) Gold 6448Y Stepping: 8 CPU MHz: 2100.000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 48K L1i cache: 32K L2 cache: 2048K L3 cache: 61440K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.1 [pip3] torchvision==0.20.1 [pip3] transformers==4.46.2 [pip3] triton==3.1.0 [conda] No relevant packages ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.4.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS NODE NODE SYS SYS 0,2,4,6,8,10 0 N/A GPU1 SYS X SYS SYS NODE NODE 1,3,5,7,9,11 1 N/A NIC0 NODE SYS X PIX SYS SYS NIC1 NODE SYS PIX X SYS SYS NIC2 SYS NODE SYS SYS X PIX NIC3 SYS NODE SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 LD_LIBRARY_PATH=/root/.vllmPythonVenv/lib/python3.12/site-packages/cv2/../../lib64:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst:/usr/local/cuda/lib64:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst:/usr/local/cuda/lib64:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst CUDA_MODULE_LOADING=LAZY ```

How would you like to use vllm

What I did along with following the installation with CPU guide and deploy with docker guide

sudo su
cd
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
docker run -it --rm --network=host --ipc=host \
    -v /local/apps/tools/aiModels:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    --env "VLLM_CPU_KVCACHE_SPACE=512" \
    vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct

this works fine, it is when I attempt to set VLLM_CPU_OMP_THREADS_BIND and -tp 2 that I run into issues. Since I have 128 cores with 2 CPUs (64 on each CPU), I tried

docker run -it --rm --network=host --ipc=host \
    -v /local/apps/tools/aiModels:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    --env "VLLM_CPU_KVCACHE_SPACE=512" \
    --env "VLLM_CPU_OMP_THREADS_BIND=0-63|64-127" \
    vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct -tp 2

and get a numa error:

Details ```log INFO 11-21 23:44:46 api_server.py:592] vLLM API server version 0.6.4.post2.dev87+ge7a8341c INFO 11-21 23:44:46 api_server.py:593] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-70B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 11-21 23:44:46 __init__.py:44] No plugins found. INFO 11-21 23:44:46 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/f115fb57-86b5-4d52-8e1b-fd21c1a18834 for IPC Path. INFO 11-21 23:44:46 api_server.py:195] Started engine process with PID 74 INFO 11-21 23:44:50 __init__.py:44] No plugins found. INFO 11-21 23:44:51 config.py:354] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'. INFO 11-21 23:44:51 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-21 23:44:51 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-21 23:44:51 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-21 23:44:51 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-21 23:44:51 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-21 23:44:54 config.py:354] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'. INFO 11-21 23:44:54 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-21 23:44:54 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-21 23:44:54 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-21 23:44:54 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-21 23:44:54 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-21 23:44:54 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev87+ge7a8341c) with config: model='meta-llama/Llama-3.1-70B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-70B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_flash_attention', 'vllm.unified_flash_infer', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes={}, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=, capture_sizes=, enabled_custom_ops=Counter(), disabled_custom_ops=Counter()) INFO 11-21 23:44:54 cpu.py:31] Cannot use _Backend.FLASH_ATTN backend on CPU. INFO 11-21 23:44:54 selector.py:141] Using Torch SDPA backend. INFO 11-21 23:44:54 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. get_mempolicy: Operation not permitted ERROR 11-21 23:44:54 engine.py:366] numa_migrate_pages failed. errno: 1 ERROR 11-21 23:44:54 engine.py:366] Traceback (most recent call last): ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine ERROR 11-21 23:44:54 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args ERROR 11-21 23:44:54 engine.py:366] return cls(ipc_path=ipc_path, ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ ERROR 11-21 23:44:54 engine.py:366] self.engine = LLMEngine(*args, **kwargs) ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 347, in __init__ ERROR 11-21 23:44:54 engine.py:366] self.model_executor = executor_class(vllm_config=vllm_config, ) ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 36, in __init__ ERROR 11-21 23:44:54 engine.py:366] self._init_executor() ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 110, in _init_executor ERROR 11-21 23:44:54 engine.py:366] self._run_workers("init_device") ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 171, in _run_workers ERROR 11-21 23:44:54 engine.py:366] driver_worker_output = self.driver_method_invoker( ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 307, in _driver_method_invoker ERROR 11-21 23:44:54 engine.py:366] return getattr(driver, method)(*args, **kwargs) ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 197, in init_device ERROR 11-21 23:44:54 engine.py:366] ret = torch.ops._C_utils.init_cpu_threads_env(self.local_omp_cpuid) ERROR 11-21 23:44:54 engine.py:366] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__ ERROR 11-21 23:44:54 engine.py:366] return self._op(*args, **(kwargs or {})) ERROR 11-21 23:44:54 engine.py:366] RuntimeError: numa_migrate_pages failed. errno: 1 Process SpawnProcess-1: ERROR 11-21 23:44:54 multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 275 died, exit code: -15 INFO 11-21 23:44:54 multiproc_worker_utils.py:120] Killing local vLLM worker processes Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine raise e File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args return cls(ipc_path=ipc_path, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__ self.engine = LLMEngine(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 347, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 36, in __init__ self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 110, in _init_executor self._run_workers("init_device") File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 171, in _run_workers driver_worker_output = self.driver_method_invoker( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 307, in _driver_method_invoker return getattr(driver, method)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_worker.py", line 197, in init_device ret = torch.ops._C_utils.init_cpu_threads_env(self.local_omp_cpuid) File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__ return self._op(*args, **(kwargs or {})) RuntimeError: numa_migrate_pages failed. errno: 1 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 650, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. ```

Also, just as an FYI, simply setting --env "OMP_NUM_THREADS=127" works, although I cannot set -tp 2. Seeing as my CPUs might alternate core ids, I tried doing 0,2,4,6...124,126|1,3,5,7...125,127 instead, so

docker run -it --rm --network=host --ipc=host \
    -v /local/apps/tools/aiModels:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<token>" \
    --env "VLLM_CPU_KVCACHE_SPACE=512" \
    --env "VLLM_CPU_OMP_THREADS_BIND=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126|1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" \
    vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct -tp 2

which also failed.

ChatGPT also recommended I try prepending numactl --interleave=all to my docker command but that didn't work either.

Any ideas of what I am missing?

zhouyuan commented 4 days ago

Hi @ccruttjr could you please do a quick try with "--privileged=true" in docker run?

thanks, -yuan

ccruttjr commented 4 days ago

Hi @ccruttjr could you please do a quick try with "--privileged=true" in docker run?

thanks, -yuan

@zhouyuan

Progress! But still failing. I tried running the two examples I showed originally but added --privileged=true as you said and got these new logs which have a different failure point. I also tried prepending numactl --interleave=all on separate runs and got the same outcome.

LOGS

``` $ docker run --privileged=true -it --rm --network=host --ipc=host -v /local/apps/tools/aiModels:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=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" --env "VLLM_CPU_KVCACHE_SPACE=512" --env "VLLM_CPU_OMP_THREADS_BIND=0-63|64-127" vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct -tp 2 [358/1665]INFO 11-22 01:30:05 api_server.py:592] vLLM API server version 0.6.4.post2.dev87+ge7a8341c INFO 11-22 01:30:05 api_server.py:593] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-70B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_slidin g_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_d ecoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 11-22 01:30:05 __init__.py:44] No plugins found. INFO 11-22 01:30:05 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/e187a400-5aeb-4c3e-89d4-8cedc59adc58 for IPC Path. INFO 11-22 01:30:05 api_server.py:195] Started engine process with PID 73 INFO 11-22 01:30:09 __init__.py:44] No plugins found. INFO 11-22 01:30:10 config.py:354] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'. INFO 11-22 01:30:10 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 01:30:10 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 01:30:10 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 01:30:10 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 01:30:10 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 01:30:14 config.py:354] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'. INFO 11-22 01:30:14 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 01:30:14 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 01:30:14 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 01:30:14 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 01:30:14 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 01:30:14 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev87+ge7a8341c) with config: model='meta-llama/Llama-3.1-70B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-70B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_flash_attention', 'vllm.unified_fl ash_infer', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes={}, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=, capture_sizes=, enabled_custom_ops=Counter(), disabled_custom_ops=Counter()) INFO 11-22 01:30:14 cpu.py:31] Cannot use _Backend.FLASH_ATTN backend on CPU. INFO 11-22 01:30:14 selector.py:141] Using Torch SDPA backend. INFO 11-22 01:30:14 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. (VllmWorkerProcess pid=274) WARNING 11-22 01:30:14 _logger.py:72] `mm_limits` has already been set for model=meta-llama/Llama-3.1-70B-Instruct, and will be overwritten by the new values. (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP threads binding of Process 274: (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 274, core 64 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 403, core 65 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 404, core 66 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 405, core 67 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 406, core 68 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 407, core 69 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 408, core 70 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 409, core 71 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 410, core 72 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 411, core 73 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 412, core 74 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 413, core 75 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 414, core 76 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 415, core 77 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 416, core 78 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 417, core 79 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 418, core 80 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 419, core 81 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 420, core 82 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 421, core 83 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 422, core 84 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 423, core 85 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 424, core 86 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 425, core 87 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 426, core 88 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 427, core 89 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 428, core 90 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 429, core 91 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 430, core 92 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 431, core 93 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 432, core 94 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 433, core 95 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 434, core 96 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 435, core 97 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 436, core 98 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 437, core 99 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 438, core 100 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 439, core 101 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 440, core 102 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 441, core 103 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 442, core 104 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 443, core 105 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 444, core 106 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 445, core 107 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 446, core 108 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 447, core 109 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 448, core 110 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 449, core 111 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 450, core 112 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 451, core 113 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 452, core 114 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 453, core 115 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 454, core 116 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 455, core 117 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 456, core 118 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 457, core 119 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 458, core 120 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 459, core 121 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 460, core 122 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 461, core 123 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 462, core 124 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 463, core 125 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 464, core 126 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] OMP tid: 465, core 127 (VllmWorkerProcess pid=274) INFO 11-22 01:30:14 cpu_worker.py:199] INFO 11-22 01:30:15 cpu_worker.py:199] OMP threads binding of Process 73: INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 73, core 0 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 591, core 1 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 592, core 2 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 593, core 3 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 594, core 4 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 595, core 5 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 596, core 6 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 597, core 7 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 598, core 8 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 599, core 9 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 600, core 10 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 601, core 11 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 602, core 12 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 603, core 13 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 604, core 14 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 605, core 15 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 606, core 16 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 607, core 17 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 608, core 18 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 609, core 19 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 610, core 20 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 611, core 21 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 612, core 22 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 613, core 23 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 614, core 24 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 615, core 25 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 616, core 26 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 617, core 27 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 618, core 28 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 619, core 29 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 620, core 30 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 621, core 31 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 622, core 32 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 623, core 33 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 624, core 34 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 625, core 35 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 626, core 36 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 627, core 37 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 628, core 38 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 629, core 39 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 630, core 40 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 631, core 41 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 632, core 42 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 633, core 43 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 634, core 44 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 635, core 45 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 636, core 46 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 637, core 47 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 638, core 48 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 639, core 49 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 640, core 50 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 641, core 51 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 642, core 52 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 643, core 53 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 644, core 54 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 645, core 55 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 646, core 56 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 647, core 57 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 648, core 58 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 649, core 59 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 650, core 60 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 651, core 61 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 652, core 62 INFO 11-22 01:30:15 cpu_worker.py:199] OMP tid: 653, core 63 INFO 11-22 01:30:15 cpu_worker.py:199] INFO 11-22 01:30:15 shm_broadcast.py:236] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=, local_subscribe_port=36421, remote_subscribe_port=None) INFO 11-22 01:30:16 weight_utils.py:243] Using model weights format ['*.safetensors'] (VllmWorkerProcess pid=274) INFO 11-22 01:30:16 weight_utils.py:243] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/30 [00:00 exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,352 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,353 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,353 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,354 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,354 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,354 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,355 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,355 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,356 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,356 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,356 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 01:32:31,357 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 650, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. ```

zhouyuan commented 4 days ago

It looks like the issue is on ZMQ now, can you please also try to add --shm-size=4g in docker run?

future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)

thanks, -yuan

ccruttjr commented 3 days ago

It looks like the issue is on ZMQ now, can you please also try to add --shm-size=4g in docker run?

@zhouyuan

Tried that and got two different errors depending on how I "bound" the cores.

VLLM_CPU_OMP_THREADS_BIND=0-63|64-127

``` docker run --shm-size=4g --privileged=true -it --rm --network=host --ipc=host -v /local/apps/tools/aiModels:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=" --env "VLLM_CPU_KVCACHE_SPACE=512" --env "VLLM_CPU_OMP_THREADS_BIND=0-63|64-127" vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct -tp 2 INFO 11-22 04:48:58 api_server.py:592] vLLM API server version 0.6.4.post2.dev87+ge7a8341c INFO 11-22 04:48:58 api_server.py:593] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-70B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 11-22 04:48:58 __init__.py:44] No plugins found. INFO 11-22 04:48:58 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/2eee7fed-b613-4251-ade9-f77bde1bdd22 for IPC Path. INFO 11-22 04:48:58 api_server.py:195] Started engine process with PID 73 INFO 11-22 04:49:02 __init__.py:44] No plugins found. INFO 11-22 04:49:04 config.py:354] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'. INFO 11-22 04:49:04 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 04:49:04 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 04:49:04 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 04:49:04 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 04:49:04 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 04:49:07 config.py:354] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'. INFO 11-22 04:49:07 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 04:49:07 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 04:49:07 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 04:49:07 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 04:49:07 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 04:49:07 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev87+ge7a8341c) with config: model='meta-llama/Llama-3.1-70B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-70B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_flash_attention', 'vllm.unified_flash_infer', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes={}, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=, capture_sizes=, enabled_custom_ops=Counter(), disabled_custom_ops=Counter()) INFO 11-22 04:49:08 cpu.py:31] Cannot use _Backend.FLASH_ATTN backend on CPU. INFO 11-22 04:49:08 selector.py:141] Using Torch SDPA backend. INFO 11-22 04:49:08 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. (VllmWorkerProcess pid=274) WARNING 11-22 04:49:08 _logger.py:72] `mm_limits` has already been set for model=meta-llama/Llama-3.1-70B-Instruct, and will be overwritten by the new values. (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP threads binding of Process 274: (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 274, core 64 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 403, core 65 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 404, core 66 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 405, core 67 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 406, core 68 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 407, core 69 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 408, core 70 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 409, core 71 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 410, core 72 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 411, core 73 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 412, core 74 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 413, core 75 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 414, core 76 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 415, core 77 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 416, core 78 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 417, core 79 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 418, core 80 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 419, core 81 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 420, core 82 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 421, core 83 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 422, core 84 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 423, core 85 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 424, core 86 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 425, core 87 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 426, core 88 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 427, core 89 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 428, core 90 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 429, core 91 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 430, core 92 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 431, core 93 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 432, core 94 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 433, core 95 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 434, core 96 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 435, core 97 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 436, core 98 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 437, core 99 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 438, core 100 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 439, core 101 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 440, core 102 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 441, core 103 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 442, core 104 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 443, core 105 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 444, core 106 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 445, core 107 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 446, core 108 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 447, core 109 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 448, core 110 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 449, core 111 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 450, core 112 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 451, core 113 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 452, core 114 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 453, core 115 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 454, core 116 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 455, core 117 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 456, core 118 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 457, core 119 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 458, core 120 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 459, core 121 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 460, core 122 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 461, core 123 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 462, core 124 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 463, core 125 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 464, core 126 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 465, core 127 (VllmWorkerProcess pid=274) INFO 11-22 04:49:08 cpu_worker.py:199] INFO 11-22 04:49:08 cpu_worker.py:199] OMP threads binding of Process 73: INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 73, core 0 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 591, core 1 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 592, core 2 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 593, core 3 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 594, core 4 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 595, core 5 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 596, core 6 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 597, core 7 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 598, core 8 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 599, core 9 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 600, core 10 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 601, core 11 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 602, core 12 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 603, core 13 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 604, core 14 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 605, core 15 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 606, core 16 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 607, core 17 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 608, core 18 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 609, core 19 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 610, core 20 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 611, core 21 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 612, core 22 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 613, core 23 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 614, core 24 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 615, core 25 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 616, core 26 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 617, core 27 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 618, core 28 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 619, core 29 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 620, core 30 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 621, core 31 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 622, core 32 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 623, core 33 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 624, core 34 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 625, core 35 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 626, core 36 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 627, core 37 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 628, core 38 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 629, core 39 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 630, core 40 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 631, core 41 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 632, core 42 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 633, core 43 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 634, core 44 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 635, core 45 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 636, core 46 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 637, core 47 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 638, core 48 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 639, core 49 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 640, core 50 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 641, core 51 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 642, core 52 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 643, core 53 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 644, core 54 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 645, core 55 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 646, core 56 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 647, core 57 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 648, core 58 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 649, core 59 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 650, core 60 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 651, core 61 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 652, core 62 INFO 11-22 04:49:08 cpu_worker.py:199] OMP tid: 653, core 63 INFO 11-22 04:49:08 cpu_worker.py:199] INFO 11-22 04:49:09 shm_broadcast.py:236] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=, local_subscribe_port=35367, remote_subscribe_port=None) (VllmWorkerProcess pid=274) INFO 11-22 04:49:09 weight_utils.py:243] Using model weights format ['*.safetensors'] INFO 11-22 04:49:10 weight_utils.py:243] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/30 [00:00 exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,882 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:51:14,883 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 650, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. ```

Individual cores

``` $ docker run --privileged=true -it --shm-size=4g --rm --network=host --ipc=host -v /local/apps/tools/aiModels:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=" --env "VLLM_CPU_KVCACHE_SPACE=512" --env "VLLM_CPU_OMP_THREADS_BIND=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126|1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127" vllm-cpu-env --model meta-llama/Llama-3.1-70B-Instruct -tp 2 INFO 11-22 04:57:43 api_server.py:592] vLLM API server version 0.6.4.post2.dev87+ge7a8341c INFO 11-22 04:57:43 api_server.py:593] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.1-70B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 11-22 04:57:43 __init__.py:44] No plugins found. INFO 11-22 04:57:43 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/b0fa7922-201b-46dc-9c37-d8654b6751e0 for IPC Path. INFO 11-22 04:57:43 api_server.py:195] Started engine process with PID 74 INFO 11-22 04:57:46 __init__.py:44] No plugins found. INFO 11-22 04:57:48 config.py:354] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'. INFO 11-22 04:57:48 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 04:57:48 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 04:57:48 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 04:57:48 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 04:57:48 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 04:57:51 config.py:354] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'. INFO 11-22 04:57:51 config.py:1025] Defaulting to use mp for distributed inference WARNING 11-22 04:57:51 _logger.py:72] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value. WARNING 11-22 04:57:51 _logger.py:72] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information. WARNING 11-22 04:57:51 _logger.py:72] Async output processing is only supported for CUDA, TPU, XPU and HPU.Disabling it for other platforms. WARNING 11-22 04:57:51 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode. INFO 11-22 04:57:51 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post2.dev87+ge7a8341c) with config: model='meta-llama/Llama-3.1-70B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.1-70B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_flash_attention', 'vllm.unified_flash_infer', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes={}, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=, capture_sizes=, enabled_custom_ops=Counter(), disabled_custom_ops=Counter()) INFO 11-22 04:57:52 cpu.py:31] Cannot use _Backend.FLASH_ATTN backend on CPU. INFO 11-22 04:57:52 selector.py:141] Using Torch SDPA backend. INFO 11-22 04:57:52 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available. (VllmWorkerProcess pid=275) WARNING 11-22 04:57:52 _logger.py:72] `mm_limits` has already been set for model=meta-llama/Llama-3.1-70B-Instruct, and will be overwritten by the new values. (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP threads binding of Process 275: (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 275, core 1 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 404, core 3 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 405, core 5 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 406, core 7 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 407, core 9 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 408, core 11 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 409, core 13 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 410, core 15 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 411, core 17 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 412, core 19 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 413, core 21 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 414, core 23 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 415, core 25 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 416, core 27 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 417, core 29 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 418, core 31 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 419, core 33 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 420, core 35 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 421, core 37 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 422, core 39 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 423, core 41 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 424, core 43 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 425, core 45 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 426, core 47 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 427, core 49 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 428, core 51 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 429, core 53 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 430, core 55 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 431, core 57 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 432, core 59 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 433, core 61 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 434, core 63 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 435, core 65 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 436, core 67 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 437, core 69 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 438, core 71 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 439, core 73 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 440, core 75 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 441, core 77 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 442, core 79 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 443, core 81 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 444, core 83 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 445, core 85 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 446, core 87 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 447, core 89 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 448, core 91 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 449, core 93 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 450, core 95 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 451, core 97 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 452, core 99 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 453, core 101 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 454, core 103 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 455, core 105 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 456, core 107 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 457, core 109 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 458, core 111 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 459, core 113 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 460, core 115 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 461, core 117 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 462, core 119 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 463, core 121 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 464, core 123 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 465, core 125 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 466, core 127 (VllmWorkerProcess pid=275) INFO 11-22 04:57:52 cpu_worker.py:199] INFO 11-22 04:57:52 cpu_worker.py:199] OMP threads binding of Process 74: INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 74, core 0 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 592, core 2 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 593, core 4 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 594, core 6 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 595, core 8 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 596, core 10 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 597, core 12 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 598, core 14 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 599, core 16 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 600, core 18 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 601, core 20 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 602, core 22 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 603, core 24 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 604, core 26 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 605, core 28 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 606, core 30 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 607, core 32 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 608, core 34 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 609, core 36 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 610, core 38 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 611, core 40 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 612, core 42 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 613, core 44 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 614, core 46 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 615, core 48 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 616, core 50 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 617, core 52 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 618, core 54 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 619, core 56 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 620, core 58 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 621, core 60 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 622, core 62 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 623, core 64 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 624, core 66 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 625, core 68 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 626, core 70 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 627, core 72 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 628, core 74 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 629, core 76 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 630, core 78 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 631, core 80 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 632, core 82 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 633, core 84 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 634, core 86 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 635, core 88 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 636, core 90 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 637, core 92 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 638, core 94 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 639, core 96 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 640, core 98 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 641, core 100 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 642, core 102 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 643, core 104 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 644, core 106 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 645, core 108 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 646, core 110 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 647, core 112 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 648, core 114 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 649, core 116 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 650, core 118 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 651, core 120 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 652, core 122 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 653, core 124 INFO 11-22 04:57:52 cpu_worker.py:199] OMP tid: 654, core 126 INFO 11-22 04:57:52 cpu_worker.py:199] INFO 11-22 04:57:52 shm_broadcast.py:236] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=, local_subscribe_port=46103, remote_subscribe_port=None) INFO 11-22 04:57:53 weight_utils.py:243] Using model weights format ['*.safetensors'] (VllmWorkerProcess pid=275) INFO 11-22 04:57:53 weight_utils.py:243] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/30 [00:00 exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,439 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,440 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,441 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,442 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,442 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,442 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,443 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,443 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,444 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported 2024-11-22 04:59:49,444 - __init__.py - asyncio - ERROR - Task exception was never retrieved future: exception=ZMQError('Operation not supported')> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll raise _zmq.ZMQError(_zmq.ENOTSUP) zmq.error.ZMQError: Operation not supported Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 650, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 616, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. ```

zhouyuan commented 3 days ago

@ccruttjr I tried the latest code locally but could not reproduce this issue. Could you please also do a test with lower KV cache size? --env "VLLM_CPU_KVCACHE_SPACE=512" on TP=2 case it will require 1024 GB memory

Thanks, -yuan

ccruttjr commented 4 hours ago

yep I'm stupid I decreased the kvcache space and it worked ❤️❤️❤️

zhouyuan commented 2 hours ago

yep I'm stupid I decreased the kvcache space and it worked ❤️❤️❤️

@ccruttjr no problem, glad to hear it worked 👍
Could you please close this as it's fixed? Will try to improve the example for CPU to highlight the memory requirements on tensor parallel

thanks, -yuan