vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.88k stars 4.51k forks source link

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3) #6025

Open yaronr opened 4 months ago

yaronr commented 4 months ago

Your current environment

Docker Image: vllm/vllm-openai:v0.4.3 as well as 0.5.0 post-1

Params:

--model=microsoft/Phi-3-medium-4k-instruct 
--tensor-parallel-size=2
--disable-log-requests
--trust-remote-code
--max-model-len=2048
--gpu-memory-utilization=0.9

The container freezes (does nothing) after presenting the following exception in the log.

🐛 Describe the bug

Original exception was:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
youkaichao commented 4 months ago

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out what is happening here?

sayakpaul commented 2 months ago

@youkaichao I am facing the same log as well. What is the general recommendation to remedy it if it's a critical issue? For my case, the programs runs fine with that message.

youkaichao commented 2 months ago

For my case, the programs runs fine with that message.

then it's just a warning you can ignore.

OsaCode commented 5 days ago

Same problem Running: `model_name = "allenai/Molmo-7B-D-0924"

llm = LLM( model=model_name, trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=2 )`

Getting : INFO 11-05 15:35:36 config.py:1704] Downcasting torch.float32 to torch.bfloat16. INFO 11-05 15:35:41 config.py:944] Defaulting to use mp for distributed inference INFO 11-05 15:35:41 llm_engine.py:242] Initializing an LLM engine (v0.6.3.post2.dev127+g2adb4409) with config: model='allenai/Molmo-7B-D-0924', speculative_config=None, tokenizer='allenai/Molmo-7B-D-0924', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=allenai/Molmo-7B-D-0924, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None) WARNING 11-05 15:35:42 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 15:35:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=9216) INFO 11-05 15:35:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' [rank0]:[W1105 15:35:43.442228048 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank1]:[W1105 15:35:43.442235220 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

On an instance with 2 Nvidia L4 GPUs. It kills my kernel.

OsaCode commented 5 days ago

I upgraded my vLLM after reading https://github.com/vllm-project/vllm/issues/9774 and it fixed the issue, although I still crash for another reason