[Bug]: Tensor Parallelism performs poorly

DanielViglione commented 1 month ago

Your current environment

This issue is easy to reproduce. In AWS: 1) Spin up EC2 2) Use the Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Ubuntu 20.04) 3) Select g5.12xlarge (which contains 4 GPUS, A10Gs, each with 24GiB GDDR6 RAM)

That's the current environment

Model Input Dumps

No response

🐛 Describe the bug

I have 4 A10Gs, each with 24GiB of GDDR6 Memory:

nvidia-smi ip-172-31-64-123: Tue Oct 15 12:00:52 2024

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

This is a total of 96GiB of memory. I try to run Meta Llama 3.2-Vision-Instruct 11B (only the 11B version). This requires no more than 26, maybe 27 GiB of memory. With vLLM it fails:

docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=hf_token" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Llama-3.2-11B-Vision-Instruct --trust-remote-code --max-model-len 30000 --gpu-memory-utilization 0.90 --dtype float16 --disable-custom-all-reduce --tensor-parallel-size 4

Unable to find image 'vllm/vllm-openai:latest' locally
latest: Pulling from vllm/vllm-openai
3c645031de29: Pull complete
0d6448aff889: Pull complete
0a7674e3e8fe: Pull complete
b71b637b97c5: Pull complete
56dc85502937: Pull complete
380ca03515b9: Pull complete
d160b2f7d269: Pull complete
2e12f762aa31: Pull complete
634df421988e: Pull complete
9e08baaeb617: Pull complete
97b9bc234e00: Pull complete
9a4663973952: Pull complete
657b98de7a0c: Pull complete
Digest: sha256:b8374cee0a1acaec8b64525ff77560f30443f67bd0fc1956a3529504a89f823b
Status: Downloaded newer image for vllm/vllm-openai:latest
/usr/local/lib/python3.12/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 10-15 03:58:10 api_server.py:528] vLLM API server version dev
INFO 10-15 03:58:10 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-15 03:58:10 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/afd4372a-c223-47eb-8533-c47003797121 for IPC Path.
INFO 10-15 03:58:10 api_server.py:179] Started engine process with PID 60
WARNING 10-15 03:58:10 config.py:1674] Casting torch.bfloat16 to torch.float16.
/usr/local/lib/python3.12/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 10-15 03:58:15 config.py:1674] Casting torch.bfloat16 to torch.float16.
INFO 10-15 03:58:18 config.py:887] Defaulting to use mp for distributed inference
INFO 10-15 03:58:22 config.py:887] Defaulting to use mp for distributed inference
INFO 10-15 03:58:22 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-15 03:58:23 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-15 03:58:23 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=272) INFO 10-15 03:58:23 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=273) INFO 10-15 03:58:23 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=272) INFO 10-15 03:58:23 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=273) INFO 10-15 03:58:23 selector.py:115] Using XFormers backend.
INFO 10-15 03:58:23 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
(VllmWorkerProcess pid=274) INFO 10-15 03:58:23 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-15 03:58:23 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=274) INFO 10-15 03:58:23 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=272) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=272)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=273) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=273)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=274) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=274)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=272) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=272)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=273) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=273)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=274) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=274)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=272) INFO 10-15 03:58:26 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=274) INFO 10-15 03:58:26 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=273) INFO 10-15 03:58:26 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=272) INFO 10-15 03:58:27 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-15 03:58:27 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=272) INFO 10-15 03:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-15 03:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=274) INFO 10-15 03:58:27 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=273) INFO 10-15 03:58:27 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=273) INFO 10-15 03:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=274) INFO 10-15 03:58:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-15 03:58:28 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f68f2cd6060>, local_subscribe_port=33711, remote_subscribe_port=None)
INFO 10-15 03:58:28 model_runner.py:1060] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=272) INFO 10-15 03:58:28 model_runner.py:1060] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=274) INFO 10-15 03:58:28 model_runner.py:1060] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=273) INFO 10-15 03:58:28 model_runner.py:1060] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
(VllmWorkerProcess pid=272) INFO 10-15 03:58:28 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=274) INFO 10-15 03:58:28 selector.py:115] Using XFormers backend.
INFO 10-15 03:58:28 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=273) INFO 10-15 03:58:28 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=272) INFO 10-15 03:58:28 weight_utils.py:243] Using model weights format ['*.safetensors']
INFO 10-15 03:58:28 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=273) INFO 10-15 03:58:28 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=274) INFO 10-15 03:58:28 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.34it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.03it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.33it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.35it/s]
(VllmWorkerProcess pid=272) INFO 10-15 03:59:00 model_runner.py:1071] Loading model weights took 5.1560 GB
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.27it/s]

INFO 10-15 03:59:01 model_runner.py:1071] Loading model weights took 5.1560 GB
(VllmWorkerProcess pid=273) INFO 10-15 03:59:01 model_runner.py:1071] Loading model weights took 5.1560 GB
(VllmWorkerProcess pid=274) INFO 10-15 03:59:01 model_runner.py:1071] Loading model weights took 5.1560 GB
INFO 10-15 03:59:01 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=273) INFO 10-15 03:59:01 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=274) INFO 10-15 03:59:01 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=272) INFO 10-15 03:59:01 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
(VllmWorkerProcess pid=274) INFO 10-15 03:59:14 multiproc_worker_utils.py:242] Worker exiting
(VllmWorkerProcess pid=272) INFO 10-15 03:59:14 multiproc_worker_utils.py:242] Worker exiting
(VllmWorkerProcess pid=273) INFO 10-15 03:59:14 multiproc_worker_utils.py:242] Worker exiting
INFO 10-15 03:59:15 multiproc_worker_utils.py:121] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 359, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 203, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 1084, in forward
    cross_attention_states = self.vision_model(pixel_values,
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 529, in forward
    hidden_state = self.gated_positional_embedding(hidden_state,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 291, in forward
    tile_position_embedding = self.tile_embedding(aspect_ratio_ids)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/sparse.py", line 164, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 21.98 GiB of which 338.44 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.18 GiB is allocated by PyTorch, and 42.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

It works a little in the sense that the load is indeed distributed across 4 GPUS. But after some time, 3 of them are no longer used and 1 of them spikes to 100 percent until the container crashes. I would expect that --tensor-parallel-size handles this.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

simon-mo commented 1 month ago

This might be the case that the encoder model is not tensor parallel shardded. @DarkLight1337 @ywang96 would you agree?

DarkLight1337 commented 1 month ago

This might be the case that the encoder model is not tensor parallel shardded. @DarkLight1337 @ywang96 would you agree?

From my understanding, the modules in the encoder already have parallelizable layers such as *ParallelLinear, just like the other vision encoders.

DarkLight1337 commented 1 month ago

That being said, I still see some individual layers not being parallelized, such as the embedding layers and multi-modal projector inside MllamaForConditionalGeneration. Not sure whether it's worth parallelizing them though. cc @heheda12345

heheda12345 commented 1 month ago

Yes. The image encoder is not fully sharded. The logic of this function is quite complex, so I only implemented TP on the standard transformer layers. Help for providing full TP support of the image encoder is highly welcomed! I'm not sure whether TP of multi-modal projector will be helpful because the full output tensor need to be at all GPUs before the attention execution, but it still worth a try if you want.

vllm-project / vllm