vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.3k stars 4.59k forks source link

[Bug]: vLLM does not support virtual GPU #5328

Open youkaichao opened 5 months ago

youkaichao commented 5 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

error reported by https://github.com/vllm-project/vllm/issues/4587 .

we need to avoid initializing nccl when the world size is 1.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

YFrendo commented 3 days ago

Hy!

There is still an issue with VGPU, the error don't occure on bare metal GPU with same driver and CUDA version:

NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4

INFO 11-14 10:50:38 api_[server.py:526](http://server.py:526/)] vLLM API server version [0.6.1.dev](http://0.6.1.dev/)238+ge2c6e0a82
INFO 11-14 10:50:38 api_[server.py:527](http://server.py:527/)] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/mnt/model_prod', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-14 10:50:38 api_[server.py:164](http://server.py:164/)] Multiprocessing frontend to use ipc:///tmp/df6bd7b4-e7e4-44a9-910e-e74b7a204f88 for IPC Path.
INFO 11-14 10:50:38 api_[server.py:177](http://server.py:177/)] Started engine process with PID 31
INFO 11-14 10:50:38 [config.py:1652](http://config.py:1652/)] Downcasting torch.float32 to torch.float16.
INFO 11-14 10:50:40 [config.py:1652](http://config.py:1652/)] Downcasting torch.float32 to torch.float16.
INFO 11-14 10:50:40 llm_[engine.py:226](http://engine.py:226/)] Initializing an LLM engine ([v0.6.1.dev](http://v0.6.1.dev/)238+ge2c6e0a82) with config: model='/mnt/model_prod', speculative_config=None, tokenizer='/mnt/model_prod', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8096, download_dir=None, load_format=[LoadFormat.AUTO](http://loadformat.auto/), tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/model_prod, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
[W1114 10:50:41.117341070 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 11-14 10:50:41 model_[runner.py:1014](http://runner.py:1014/)] Starting to load model /mnt/model_prod...
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/[process.py](http://process.py/)", line 314, in _bootstrap
[self.run](http://self.run/)()
File "/usr/lib/python3.12/multiprocessing/[process.py](http://process.py/)", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/[engine.py](http://engine.py/)", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/[engine.py](http://engine.py/)", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/[engine.py](http://engine.py/)", line 78, in init
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_[engine.py](http://engine.py/)", line 325, in init
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_[base.py](http://base.py/)", line 47, in init
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_[executor.py](http://executor.py/)", line 40, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/[worker.py](http://worker.py/)", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_[runner.py](http://runner.py/)", line 1016, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/[loader.py](http://loader.py/)", line 399, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/[loader.py](http://loader.py/)", line 176, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/[loader.py](http://loader.py/)", line 161, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/[llama.py](http://llama.py/)", line 410, in init
self.model = LlamaModel(config,
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/[llama.py](http://llama.py/)", line 284, in init
self.embed_tokens = VocabParallelEmbedding(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_[embedding.py](http://embedding.py/)", line 260, in init
self.linear_method.create_weights(self,
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_[embedding.py](http://embedding.py/)", line 28, in create_weights
weight = Parameter(torch.empty(sum(output_partition_sizes),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_[device.py](http://device.py/)", line 79, in torch_function
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA driver error: operation not supported
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_[server.py](http://server.py/)", line 571, in <module>
[uvloop.run](http://uvloop.run/)(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __[asyncio.run](http://asyncio.run/)(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/[runners.py](http://runners.py/)", line 194, in run
return [runner.run](http://runner.run/)(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/[runners.py](http://runners.py/)", line 118, in run
return self._[loop.run](http://loop.run/)_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in [uvloop.loop.Loop.run](http://uvloop.loop.loop.run/)_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_[server.py](http://server.py/)", line 538, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/[contextlib.py](http://contextlib.py/)", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_[server.py](http://server.py/)", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/[contextlib.py](http://contextlib.py/)", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_[server.py](http://server.py/)", line 192, in build_async_engine_client_from_engine_args
raise RuntimeError(