vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.69k stars 3.4k forks source link

No `device_map` option. #196

Closed beratcmn closed 1 year ago

beratcmn commented 1 year ago

Currently there is no way to use large models hence there is no support for 8-bit quantization and more importantly there is no support for device mapping.

As you can see first GPU is filled but second GPU is left unallocated. image

Here is the error message: OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 23.70 GiB total capacity; 22.40 GiB already allocated; 247.50 MiB free; 22.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

beratcmn commented 1 year ago

Also using tensor_parallel_size=2 raises an error.

Error message:


ValueError                                Traceback (most recent call last)
Cell In[3], line 1
----> 1 llm = LLM(model="TheBloke/wizardLM-7B-HF", download_dir="./models/", dtype="half", tensor_parallel_size=2)

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:55, in LLM.__init__(self, model, tensor_parallel_size, dtype, seed, **kwargs)
     47     kwargs["disable_log_stats"] = True
     48 engine_args = EngineArgs(
     49     model=model,
     50     tensor_parallel_size=tensor_parallel_size,
   (...)
     53     **kwargs,
     54 )
---> 55 self.llm_engine = LLMEngine.from_engine_args(engine_args)
     56 self.request_counter = Counter()

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:145, in LLMEngine.from_engine_args(cls, engine_args)
    143 distributed_init_method, devices = initialize_cluster(parallel_config)
    144 # Create the LLM engine.
--> 145 engine = cls(*engine_configs, distributed_init_method, devices,
    146              log_stats=not engine_args.disable_log_stats)
    147 return engine

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:87, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, distributed_init_method, stage_devices, log_stats)
     85 worker_cls = Worker
     86 if self.parallel_config.worker_use_ray:
---> 87     worker_cls = ray.remote(
     88         num_cpus=0,
     89         num_gpus=1,
     90         resources={node_resource: 1e-5},
     91     )(worker_cls).remote
     93 worker = worker_cls(
     94     model_config,
     95     parallel_config,
   (...)
     98     distributed_init_method,
     99 )
    100 self.workers.append(worker)

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/worker.py:2879, in _make_remote(function_or_class, options)
   2871     return ray.remote_function.RemoteFunction(
   2872         Language.PYTHON,
   2873         function_or_class,
   2874         None,
   2875         options,
   2876     )
   2878 if inspect.isclass(function_or_class):
-> 2879     ray_option_utils.validate_actor_options(options, in_options=False)
   2880     return ray.actor._make_actor(function_or_class, options)
   2882 raise TypeError(
   2883     "The @ray.remote decorator must be applied to either a function or a class."
   2884 )

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/ray_option_utils.py:308, in validate_actor_options(options, in_options)
    303     if k not in actor_options:
    304         raise ValueError(
    305             f"Invalid option keyword {k} for actors. "
    306             f"Valid ones are {list(actor_options)}."
    307         )
--> 308     actor_options[k].validate(k, v)
    310 if in_options and "concurrency_groups" in options:
    311     raise ValueError(
    312         "Setting 'concurrency_groups' is not supported in '.options()'."
    313     )

File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/ray_option_utils.py:38, in Option.validate(self, keyword, value)
     36 possible_error_message = self.value_constraint(value)
     37 if possible_error_message:
---> 38     raise ValueError(possible_error_message)

ValueError: The precision of the fractional quantity of resource node:192.168.1.200 cannot go beyond 0.0001```
WoosukKwon commented 1 year ago

Hi @beratcmn, thanks for reporting the bug. The bug was fixed in a recent PR: #193, but we haven't updated our PyPI package yet. Could you either install vLLM from source or downgrade the Ray version as follows?:

$ pip uninstall ray
$ pip install ray==2.4.0
$ ray start --head
zhuohan123 commented 1 year ago

We have updated our PyPi package, which fixed this issue. Please upgrade and check again. Feel free to re-open this issue if you still get the error.

beratcmn commented 1 year ago

We have updated our PyPi package, which fixed this issue. Please upgrade and check again. Feel free to re-open this issue if you still get the error.

Sorry for the late answer, it's been a long week. I'll try to test as soon as possible. I'll reopen this issue if I get a related error. Thanks in advance.