vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.03k stars 4.72k forks source link

[Bug]: serve Llama-3.2-11B-Vision-Instruct with 2 A10 oom #10034

Closed jjyyds closed 3 weeks ago

jjyyds commented 3 weeks ago

Your current environment

docker image vllm/vllm-openai:v0.6.2 and vllm/vllm-openai:v0.6.3 command:docker run --runtime nvidia --gpus '"device=0,1"' -d -v /data/model/llama:/data/model/llama -p 8001:8000 vllm/vllm-openai:v0.6.2 --model /data/model/llama --max-model-len 1024 --served_model_name Llama-3.2-11B-Vision-Instruct --tensor-parallel-size 2 --gpu_memory_utilization 0.7

I tried v0.6.2 and v0.6.3,both not work,only half of the gpu memory is occupied

nvidia-smi output: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10 On | 00000000:00:04.0 Off | 0 | | 0% 47C P0 60W / 150W | 10797MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:00:08.0 Off | 0 | | 0% 47C P0 61W / 150W | 10797MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A10 On | 00000000:00:09.0 Off | 0 | | 0% 55C P0 63W / 150W | 18493MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2750406 C /usr/bin/python3 10788MiB | | 1 N/A N/A 2750455 C /usr/bin/python3 10788MiB | | 2 N/A N/A 2738538 C ...nda/miniconda3/envs/vllm/bin/python 18470MiB | +-----------------------------------------------------------------------------------------+

Model Input Dumps

No response

🐛 Describe the bug

INFO 11-05 03:21:18 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 03:21:18 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1024, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.7, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 03:21:18 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/056a940a-9835-44c7-95c2-01e0c84f4ed4 for IPC Path. INFO 11-05 03:21:18 api_server.py:177] Started engine process with PID 29 INFO 11-05 03:21:18 config.py:899] Defaulting to use mp for distributed inference junjie_wang@ubuntuT-CCLLM-4804084:~$ sudo docker logs c123d8180a936a7bda26a6b3ce14dbee4964ac79b89f1ae74bb29c3467e0b37f INFO 11-05 03:21:18 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 03:21:18 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1024, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.7, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 03:21:18 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/056a940a-9835-44c7-95c2-01e0c84f4ed4 for IPC Path. INFO 11-05 03:21:18 api_server.py:177] Started engine process with PID 29 INFO 11-05 03:21:18 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 03:21:22 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 03:21:22 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 03:21:23 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 03:21:23 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager INFO 11-05 03:21:23 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 03:21:23 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 03:21:23 selector.py:116] Using XFormers backend. INFO 11-05 03:21:23 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 03:21:23 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 03:21:24 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 03:21:24 utils.py:992] Found nccl from library libnccl.so.2 INFO 11-05 03:21:24 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=67) INFO 11-05 03:21:24 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 03:21:24 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 03:21:35 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 03:21:35 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 03:21:35 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f3afe2f6570>, local_subscribe_port=33245, remote_subscribe_port=None) INFO 11-05 03:21:35 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 03:21:35 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 03:21:35 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 03:21:35 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:04, 1.08s/it] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:02<00:03, 1.22s/it] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:03<00:02, 1.23s/it] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:05<00:01, 1.28s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.05it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:05<00:00, 1.07s/it]

INFO 11-05 03:21:41 model_runner.py:1025] Loading model weights took 10.0714 GB (VllmWorkerProcess pid=67) INFO 11-05 03:21:41 model_runner.py:1025] Loading model weights took 10.0714 GB INFO 11-05 03:21:41 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 03:21:41 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 1 has a total capacity of 22.09 GiB of which 758.44 MiB is free. Process 2751070 has 21.35 GiB memory in use. Of the allocated memory 20.90 GiB is allocated by PyTorch, and 85.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last): (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] output = executor(*args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return func(*args, *kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] self.model_runner.profile_run() (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return func(args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] self.execute_model(model_input, kv_caches, intermediate_tensors) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return func(*args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return forward_call(args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] cross_attention_states = self.vision_model(pixel_values, (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 508, in forward (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] patch_embeds = self.patch_embedding( (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return forward_call(*args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 229, in forward (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_workerutils.py:233] x, = self._linear(x) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return forward_call(args, kwargs) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 367, in forward (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] output_parallel = self.quantmethod.apply(self, input, bias) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] return F.linear(x, layer.weight, bias) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 1 has a total capacity of 22.09 GiB of which 758.44 MiB is free. Process 2751070 has 21.35 GiB memory in use. Of the allocated memory 20.90 GiB is allocated by PyTorch, and 85.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (VllmWorkerProcess pid=67) ERROR 11-05 03:21:52 multiproc_worker_utils.py:233] (VllmWorkerProcess pid=67) INFO 11-05 03:21:53 multiproc_worker_utils.py:244] Worker exiting INFO 11-05 03:21:53 multiproc_worker_utils.py:124] Killing local vLLM worker processes Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init self.engine = LLMEngine(args, ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in init self._initialize_kv_caches() File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers driver_worker_output = driver_worker_method(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward cross_attention_states = self.vision_model(pixel_values, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 508, in forward patch_embeds = self.patch_embedding( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/modelexecutor/models/mllama.py", line 229, in forward x, = self._linear(x) ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 367, in forward output_parallel = self.quantmethod.apply(self, input, bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply return F.linear(x, layer.weight, bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 0 has a total capacity of 22.09 GiB of which 758.44 MiB is free. Process 2751020 has 21.35 GiB memory in use. Of the allocated memory 20.90 GiB is allocated by PyTorch, and 85.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]:[W1105 03:21:54.083552440 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Before submitting a new issue...

DarkLight1337 commented 3 weeks ago

If your GPUs are already occupied, you should increase --gpu-memory-utilization (e.g. to the default 0.9). vLLM will allocate GPU memory up to that amount (e.g. if 50% of your GPU is already used, vLLM can only use an extra 40%), so you should increase this to avoid OOM.

jjyyds commented 3 weeks ago

If your GPUs are already occupied, you should increase --gpu-memory-utilization (e.g. to the default 0.9). vLLM will allocate GPU memory up to that amount (e.g. if 50% of your GPU is already used, vLLM can only use an extra 40%), so you should increase this to avoid OOM.

I increase gpu_memory_utilization to 0.9,still the same error

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 0 has a total capacity of 22.09 GiB of which 758.44 MiB is free. Process 2752500 has 21.35 GiB memory in use. Of the allocated memory 20.90 GiB is allocated by PyTorch, and 85.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

DarkLight1337 commented 3 weeks ago

With 2x 40% A10s, you still only have around 20GB memory total. This may be insufficient for a 11B model.

DarkLight1337 commented 3 weeks ago

You can try reducing max_model_len and/or max_num_seqs to get the model to run (but even if this works, your throughput will be quite poor)

jjyyds commented 3 weeks ago

With 2x 40% A10s, you still only have around 20GB memory total. This may be insufficient for a 11B model.

I used 2x 100% A10s,the "nvidia-smi" shows docker running status,when per gpu use about 10G,oom happened

jjyyds commented 3 weeks ago

before running: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10 On | 00000000:00:04.0 Off | 0 | | 0% 34C P8 16W / 150W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:00:08.0 Off | 0 | | 0% 33C P8 15W / 150W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A10 On | 00000000:00:09.0 Off | 0 | | 0% 58C P0 65W / 150W | 18493MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 2 N/A N/A 2738538 C ...nda/miniconda3/envs/vllm/bin/python 18470MiB | +-----------------------------------------------------------------------------------------+

running: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10 On | 00000000:00:04.0 Off | 0 | | 0% 39C P0 56W / 150W | 10797MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A10 On | 00000000:00:08.0 Off | 0 | | 0% 39C P0 60W / 150W | 10797MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A10 On | 00000000:00:09.0 Off | 0 | | 0% 54C P0 62W / 150W | 18493MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2753285 C /usr/bin/python3 10788MiB | | 1 N/A N/A 2753334 C /usr/bin/python3 10788MiB | | 2 N/A N/A 2738538 C ...nda/miniconda3/envs/vllm/bin/python 18470MiB | +-----------------------------------------------------------------------------------------+

DarkLight1337 commented 3 weeks ago

What is the command you are using now?

jjyyds commented 3 weeks ago

What is the command you are using now?

docker run --runtime nvidia --gpus '"device=0,1"' -d -v /data/model/llama:/data/model/llama -p 8001:8000 vllm/vllm-openai:v0.6.2 --model /data/model/llama --max-model-len 32 --served_model_name Llama-3.2-11B-Vision-Instruct --tensor-parallel-size 2 --gpu_memory_utilization 0.7

I tried adjust gpu_memory_utilization and max-model-len, both not work

DarkLight1337 commented 3 weeks ago

You should try adjusting both --max-model-len and --max-num-seqs. For example, --max-model-len 4096 --max-num-seqs 1.

jjyyds commented 3 weeks ago

You should try adjusting both --max-model-len and --max-num-seqs. For example, --max-model-len 4096 --max-num-seqs 1.

I tried this command:docker run --runtime nvidia --gpus '"device=0,1"' -d -v /data/model/llama:/data/model/llama -p 8001:8000 vllm/vllm-openai:v0.6.2 --model /data/model/llama --served_model_name Llama-3.2-11B-Vision-Instruct --tensor-parallel-size 2 --gpu_memory_utilization 0.7 --max-model-len 4096 --max-num-seqs 1

happend new error: INFO 11-05 04:09:28 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:09:28 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.7, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:09:28 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/357372b8-510b-437d-b848-a78e43041056 for IPC Path. INFO 11-05 04:09:28 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:09:28 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:09:32 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:09:32 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:09:33 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:09:33 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:09:33 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:09:33 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:09:34 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:09:34 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:09:34 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:09:34 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:09:34 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:09:45 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:09:45 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fc561982300>, local_subscribe_port=39681, remote_subscribe_port=None) INFO 11-05 04:09:45 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 04:09:45 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.27it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.12it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.08it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.05it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.44it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.27it/s]

INFO 11-05 04:09:49 model_runner.py:1025] Loading model weights took 10.0714 GB (VllmWorkerProcess pid=67) INFO 11-05 04:09:49 model_runner.py:1025] Loading model weights took 10.0714 GB INFO 11-05 04:09:49 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:09:49 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:09:51 multiproc_worker_utils.py:244] Worker exiting junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ sudo docker logs fc3fa306da81263d2e95374e949ef155edb3ae928c9818429dc7baaa250ddd3c INFO 11-05 04:09:28 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:09:28 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.7, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:09:28 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/357372b8-510b-437d-b848-a78e43041056 for IPC Path. INFO 11-05 04:09:28 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:09:28 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:09:32 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:09:32 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:09:33 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:09:33 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:09:33 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:09:33 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:09:33 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:09:34 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:09:34 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:09:34 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:09:34 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:09:34 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:09:45 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:09:45 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fc561982300>, local_subscribe_port=39681, remote_subscribe_port=None) INFO 11-05 04:09:45 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 04:09:45 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:09:45 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.27it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.12it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.08it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.05it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.44it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.27it/s]

INFO 11-05 04:09:49 model_runner.py:1025] Loading model weights took 10.0714 GB (VllmWorkerProcess pid=67) INFO 11-05 04:09:49 model_runner.py:1025] Loading model weights took 10.0714 GB INFO 11-05 04:09:49 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:09:49 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:09:51 multiproc_worker_utils.py:244] Worker exiting Process SpawnProcess-1: INFO 11-05 04:09:51 multiproc_worker_utils.py:124] Killing local vLLM worker processes Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init self.engine = LLMEngine(args, ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in init self._initialize_kv_caches() File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers driver_worker_output = driver_worker_method(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward cross_attention_states = self.vision_model(pixel_values, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 511, in forward hidden_state = ps.get_tp_group().all_gather(hidden_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 407, in all_gather torch.distributed.all_gather_into_tensor(output_tensor, File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor, opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-b4Wttn (size 9637888) [rank0]:[W1105 04:09:52.672965266 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

DarkLight1337 commented 3 weeks ago

Do you get the same error if you increase --gpu-memory-utilization to 0.9?

DarkLight1337 commented 3 weeks ago

Also, please format your stack trace inside code blocks (triple backticks), they are difficult to read.

jjyyds commented 3 weeks ago

Do you get the same error if you increase --gpu-memory-utilization to 0.9?

set gpu-memory-utilization 0.9 is the same error

INFO 11-05 04:16:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:16:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:16:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4825432-ef07-4005-aa76-dc22f7bea699 for IPC Path. INFO 11-05 04:16:25 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:16:25 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:16:30 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:16:30 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:16:32 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f2308c46330>, local_subscribe_port=33413, remote_subscribe_port=None) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ sudo docker logs 5ea93062e0e558c12ca9fc65282e27d10a03ebfe1a3ea71eb965599beac87423 INFO 11-05 04:16:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:16:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:16:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4825432-ef07-4005-aa76-dc22f7bea699 for IPC Path. INFO 11-05 04:16:25 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:16:25 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:16:30 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:16:30 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:16:32 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f2308c46330>, local_subscribe_port=33413, remote_subscribe_port=None) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.30it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.15it/s] junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ sudo docker logs 5ea93062e0e558c12ca9fc65282e27d10a03ebfe1a3ea71eb965599beac87423 INFO 11-05 04:16:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:16:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:16:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4825432-ef07-4005-aa76-dc22f7bea699 for IPC Path. INFO 11-05 04:16:25 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:16:25 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:16:30 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:16:30 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:16:32 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f2308c46330>, local_subscribe_port=33413, remote_subscribe_port=None) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.30it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.15it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.10it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.07it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.48it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.30it/s]

INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB (VllmWorkerProcess pid=67) INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ junjie_wang@ubuntuT-CCLLM-4804084:/data/model$ sudo docker logs 5ea93062e0e558c12ca9fc65282e27d10a03ebfe1a3ea71eb965599beac87423 INFO 11-05 04:16:25 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82 INFO 11-05 04:16:25 api_server.py:527] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/model/llama', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.2-11B-Vision-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 11-05 04:16:25 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/b4825432-ef07-4005-aa76-dc22f7bea699 for IPC Path. INFO 11-05 04:16:25 api_server.py:177] Started engine process with PID 29 INFO 11-05 04:16:25 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 config.py:899] Defaulting to use mp for distributed inference INFO 11-05 04:16:29 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/data/model/llama', speculative_config=None, tokenizer='/data/model/llama', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 11-05 04:16:30 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 04:16:30 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-05 04:16:30 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. INFO 11-05 04:16:30 selector.py:116] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_fwd") (VllmWorkerProcess pid=67) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. (VllmWorkerProcess pid=67) @torch.library.impl_abstract("xformers_flash::flash_bwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") (VllmWorkerProcess pid=67) INFO 11-05 04:16:30 multiproc_worker_utils.py:218] Worker ready; awaiting tasks INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 utils.py:992] Found nccl from library libnccl.so.2 INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=67) INFO 11-05 04:16:32 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 11-05 04:16:32 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 11-05 04:16:43 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f2308c46330>, local_subscribe_port=33413, remote_subscribe_port=None) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 model_runner.py:1014] Starting to load model /data/model/llama... INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. (VllmWorkerProcess pid=67) INFO 11-05 04:16:43 selector.py:116] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.30it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.15it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.10it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.07it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.48it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.30it/s]

INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB (VllmWorkerProcess pid=67) INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models. (VllmWorkerProcess pid=67) INFO 11-05 04:16:48 multiproc_worker_utils.py:244] Worker exiting Process SpawnProcess-1: INFO 11-05 04:16:48 multiproc_worker_utils.py:124] Killing local vLLM worker processes Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init self.engine = LLMEngine(args, ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in init self._initialize_kv_caches() File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks num_blocks = self._run_workers("determine_num_available_blocks", ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers driver_worker_output = driver_worker_method(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model hidden_or_intermediate_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward cross_attention_states = self.vision_model(pixel_values, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 511, in forward hidden_state = ps.get_tp_group().all_gather(hidden_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 407, in all_gather torch.distributed.all_gather_into_tensor(output_tensor, File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor, opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: Error while creating shared memory segment /dev/shm/nccl-rimw0p (size 9637888) [rank0]:[W1105 04:16:49.175113364 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]`

DarkLight1337 commented 3 weeks ago

Can you run python collect_env.py and show the output?

jjyyds commented 3 weeks ago

Can you run python collect_env.py and show the output?

output: `Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.5.82 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 GPU 1: NVIDIA A10 GPU 2: NVIDIA A10

Nvidia driver version: 555.42.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz CPU family: 6 Model: 106 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 16 Stepping: 6 BogoMIPS: 5187.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 10 MiB (8 instances) L3 cache: 48 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi`

DarkLight1337 commented 3 weeks ago

What does your GPU topology look like? It should be further down the logs.

jjyyds commented 3 weeks ago

What does your GPU topology look like? It should be further down the logs.

this all logs,i can't see GPU topology

Is it because of my machine is a virtual machine

DarkLight1337 commented 3 weeks ago

@youkaichao any ideas?

youkaichao commented 3 weeks ago

the messages are quite long and I don't get the key points.

DarkLight1337 commented 3 weeks ago

The error is this:

INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB
(VllmWorkerProcess pid=67) INFO 11-05 04:16:47 model_runner.py:1025] Loading model weights took 10.0714 GB
INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=67) INFO 11-05 04:16:47 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=67) INFO 11-05 04:16:48 multiproc_worker_utils.py:244] Worker exiting
Process SpawnProcess-1:
INFO 11-05 04:16:48 multiproc_worker_utils.py:124] Killing local vLLM worker processes
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in init
self._initialize_kv_caches()
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 474, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 348, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 201, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward
cross_attention_states = self.vision_model(pixel_values,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 511, in forward
hidden_state = ps.get_tp_group().all_gather(hidden_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 407, in all_gather
torch.distributed.all_gather_into_tensor(output_tensor,
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor, opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Error while creating shared memory segment /dev/shm/nccl-rimw0p (size 9637888)
[rank0]:[W1105 04:16:49.175113364 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]`
youkaichao commented 3 weeks ago

Error while creating shared memory segment /dev/shm/nccl-rimw0p (size 9637888)

this is the error. if you are running docker, it means the shared memory size is not enough. see https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

jjyyds commented 3 weeks ago

I used following command,it's worked

command:docker run --runtime nvidia --gpus '"device=0,1"' -d -v /data/model/llama:/data/model/llama -p 8001:8000 -e CUDA_LAUNCH_BLOCKING=1 --shm-size=24g vllm/vllm-openai:v0.6.3 --model /data/model/llama --served_model_name Llama-3.2-11B-Vision-Instruct --tensor-parallel-size 2 --gpu_memory_utilization 0.7 --max-model-len 4096 --max-num-seqs 1 --enforce-eager

Thanks @DarkLight1337 and @youkaichao Regards