Closed beratcmn closed 1 year ago
Also using tensor_parallel_size=2
raises an error.
Error message:
ValueError Traceback (most recent call last)
Cell In[3], line 1
----> 1 llm = LLM(model="TheBloke/wizardLM-7B-HF", download_dir="./models/", dtype="half", tensor_parallel_size=2)
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:55, in LLM.__init__(self, model, tensor_parallel_size, dtype, seed, **kwargs)
47 kwargs["disable_log_stats"] = True
48 engine_args = EngineArgs(
49 model=model,
50 tensor_parallel_size=tensor_parallel_size,
(...)
53 **kwargs,
54 )
---> 55 self.llm_engine = LLMEngine.from_engine_args(engine_args)
56 self.request_counter = Counter()
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:145, in LLMEngine.from_engine_args(cls, engine_args)
143 distributed_init_method, devices = initialize_cluster(parallel_config)
144 # Create the LLM engine.
--> 145 engine = cls(*engine_configs, distributed_init_method, devices,
146 log_stats=not engine_args.disable_log_stats)
147 return engine
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:87, in LLMEngine.__init__(self, model_config, cache_config, parallel_config, scheduler_config, distributed_init_method, stage_devices, log_stats)
85 worker_cls = Worker
86 if self.parallel_config.worker_use_ray:
---> 87 worker_cls = ray.remote(
88 num_cpus=0,
89 num_gpus=1,
90 resources={node_resource: 1e-5},
91 )(worker_cls).remote
93 worker = worker_cls(
94 model_config,
95 parallel_config,
(...)
98 distributed_init_method,
99 )
100 self.workers.append(worker)
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/worker.py:2879, in _make_remote(function_or_class, options)
2871 return ray.remote_function.RemoteFunction(
2872 Language.PYTHON,
2873 function_or_class,
2874 None,
2875 options,
2876 )
2878 if inspect.isclass(function_or_class):
-> 2879 ray_option_utils.validate_actor_options(options, in_options=False)
2880 return ray.actor._make_actor(function_or_class, options)
2882 raise TypeError(
2883 "The @ray.remote decorator must be applied to either a function or a class."
2884 )
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/ray_option_utils.py:308, in validate_actor_options(options, in_options)
303 if k not in actor_options:
304 raise ValueError(
305 f"Invalid option keyword {k} for actors. "
306 f"Valid ones are {list(actor_options)}."
307 )
--> 308 actor_options[k].validate(k, v)
310 if in_options and "concurrency_groups" in options:
311 raise ValueError(
312 "Setting 'concurrency_groups' is not supported in '.options()'."
313 )
File ~/repo/local-agent/.venv/lib/python3.10/site-packages/ray/_private/ray_option_utils.py:38, in Option.validate(self, keyword, value)
36 possible_error_message = self.value_constraint(value)
37 if possible_error_message:
---> 38 raise ValueError(possible_error_message)
ValueError: The precision of the fractional quantity of resource node:192.168.1.200 cannot go beyond 0.0001```
Hi @beratcmn, thanks for reporting the bug. The bug was fixed in a recent PR: #193, but we haven't updated our PyPI package yet. Could you either install vLLM from source or downgrade the Ray version as follows?:
$ pip uninstall ray
$ pip install ray==2.4.0
$ ray start --head
We have updated our PyPi package, which fixed this issue. Please upgrade and check again. Feel free to re-open this issue if you still get the error.
We have updated our PyPi package, which fixed this issue. Please upgrade and check again. Feel free to re-open this issue if you still get the error.
Sorry for the late answer, it's been a long week. I'll try to test as soon as possible. I'll reopen this issue if I get a related error. Thanks in advance.
Currently there is no way to use large models hence there is no support for 8-bit quantization and more importantly there is no support for device mapping.
As you can see first GPU is filled but second GPU is left unallocated.
Here is the error message:
OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 23.70 GiB total capacity; 22.40 GiB already allocated; 247.50 MiB free; 22.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF