vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.39k stars 3.67k forks source link

Unable to run any model with tensor_parallel_size>1 on AWS sagemaker notebooks #2084

Open samarthsarin opened 8 months ago

samarthsarin commented 8 months ago

I am running my code on AWS Sagemaker notebooks and I have machine with 4 GPUs. Whenever I set the tensor_parallel_size>1 it shows me the following error.

NFO 12-13 13:07:31 llm_engine.py:72] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, seed=0) (RayWorker pid=14391) pytorch-2-0-1-gpu-ml-g4dn-12xlarge-a1771cff5b2706f02b86883798ff:14391:14391 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^lo,docker,veth (RayWorker pid=14391) (RayWorker pid=14391) pytorch-2-0-1-gpu-ml-g4dn-12xlarge-a1771cff5b2706f02b86883798ff:14391:14391 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found (RayWorker pid=14391) pytorch-2-0-1-gpu-ml-g4dn-12xlarge-a1771cff5b2706f02b86883798ff:14391:14391 [0] NCCL INFO init.cc:82 -> 3 (RayWorker pid=14391) pytorch-2-0-1-gpu-ml-g4dn-12xlarge-a1771cff5b2706f02b86883798ff:14391:14391 [0] NCCL INFO init.cc:101 -> 3

RayTaskError(DistBackendError) Traceback (most recent call last) Cell In[1], line 3 1 from vllm import LLM 2 import torch ----> 3 llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1",dtype=torch.float16,tensor_parallel_size=4)

File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py:93, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, kwargs) 77 kwargs["disable_log_stats"] = True 78 engine_args = EngineArgs( 79 model=model, 80 tokenizer=tokenizer, (...) 91 kwargs, 92 ) ---> 93 self.llm_engine = LLMEngine.from_engine_args(engine_args) 94 self.request_counter = Counter()

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:231, in LLMEngine.from_engine_args(cls, engine_args) 228 distributed_init_method, placement_group = initialize_cluster( 229 parallel_config) 230 # Create the LLM engine. --> 231 engine = cls(*engine_configs, 232 distributed_init_method, 233 placement_group, 234 log_stats=not engine_args.disable_log_stats) 235 return engine

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:108, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, distributed_init_method, placement_group, log_stats) 106 # Create the parallel GPU workers. 107 if self.parallel_config.worker_use_ray: --> 108 self._init_workers_ray(placement_group) 109 else: 110 self._init_workers(distributed_init_method)

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:181, in LLMEngine._init_workers_ray(self, placement_group, **ray_remote_kwargs) 171 scheduler_config = copy.deepcopy(self.scheduler_config) 172 self._run_workers("init_worker", 173 get_all_outputs=True, 174 worker_init_fn=lambda: Worker( (...) 179 None, 180 )) --> 181 self._run_workers( 182 "init_model", 183 get_all_outputs=True, 184 )

File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py:704, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs) 701 all_outputs.append(output) 703 if self.parallel_config.worker_use_ray: --> 704 all_outputs = ray.get(all_outputs) 706 if get_all_outputs: 707 return all_outputs

File /opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init..auto_init_wrapper(*args, kwargs) 21 @wraps(fn) 22 def auto_init_wrapper(*args, *kwargs): 23 auto_init_ray() ---> 24 return fn(args, kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook..wrapper(*args, kwargs) 101 if func.name != "init" or is_client_mode_enabled_by_default: 102 return getattr(ray, func.name)(*args, *kwargs) --> 103 return func(args, kwargs)

File /opt/conda/lib/python3.10/site-packages/ray/_private/worker.py:2563, in get(object_refs, timeout) 2561 worker.core_worker.dump_object_store_memory_usage() 2562 if isinstance(value, RayTaskError): -> 2563 raise value.as_instanceof_cause() 2564 else: 2565 raise value

RayTaskError(DistBackendError): ray::RayWorker.execute_method() (pid=14391, ip=169.255.254.1, actor_id=ca4a3293072d7aac67ca68fc01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7fb24e75d4b0>) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 32, in execute_method return executor(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model _init_distributed_environment(self.parallel_config, self.rank, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 406, in _init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1249, internal error - please report this issue to the NCCL developers, NCCL version 2.18.5 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found

YihengLiu1996 commented 8 months ago

same problem

nooodles2023 commented 7 months ago

same problem

ashwinkumarm commented 5 months ago

@samarthsarin @YihengLiu1996 @nooodles2023 Were you able to fix this?

samarthsarin commented 5 months ago

Yes I think with some recent changes in vllm library after installing the latest version and tensor_parallel_size =4 was able to solve my concern and it is working fine

On Thu, 28 Mar, 2024, 15:21 ashwinkumarm, @.***> wrote:

@samarthsarin https://github.com/samarthsarin @YihengLiu1996 https://github.com/YihengLiu1996 @nooodles2023 https://github.com/nooodles2023 Were you able to fix this?

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/2084#issuecomment-2024800296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJSHET5MBVZOKN3VX6HLHTDY2PR3XAVCNFSM6AAAAABATEZC5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUHAYDAMRZGY . You are receiving this because you were mentioned.Message ID: @.***>

ashwinkumarm commented 5 months ago

@samarthsarin I'm facing the same issue can you please help me with vllm, torch, xformers and cuda version. I'm trying with vllm 0.3.3 (latest) with cuda 12.2 on g5.12xlarge AWS instance.

youkaichao commented 5 months ago

@ashwinkumarm can you try to install from source with the latest main branch? We recently have some improvement with tp>1 , but it is not in v0.3.3 yet.

RomanKoshkin commented 3 months ago

@youkaichao Could you please share a minimal working example for offline inference with tensor-parallel-size > 1?

youkaichao commented 3 months ago

@RomanKoshkin When I use the code from latest main, the following code can work:

import torch
from torch import nn
from vllm.model_executor.sampling_metadata import SamplingMetadata

prompts = ["Hello, my name is"]
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0)
llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(generated_text)
RomanKoshkin commented 3 months ago

@youkaichao let me try se if it works for me. By the way, can you see if llama3-8b works? And what hardware / cuda are you using?