Closed tom-doerr closed 10 months ago
Also happens when I just set logprobs=1,
.
Output:
WARNING 12-11 02:34:15 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:34:15 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 12-11 02:34:21 llm_engine.py:207] # GPU blocks: 2178, # CPU blocks: 327
Processed prompts: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 14.32it/s]
Batch size: 1, Time taken: 0.09 seconds
Processed prompts: 100%|████████████████████████████| 101/101 [00:00<00:00, 107.95it/s]
Batch size: 101, Time taken: 0.94 seconds
Processed prompts: 100%|████████████████████████████| 201/201 [00:01<00:00, 114.76it/s]
Batch size: 201, Time taken: 1.76 seconds
Processed prompts: 100%|████████████████████████████| 301/301 [00:02<00:00, 105.77it/s]
Batch size: 301, Time taken: 2.86 seconds
Processed prompts: 100%|████████████████████████████| 401/401 [00:03<00:00, 110.28it/s]
Batch size: 401, Time taken: 3.66 seconds
Processed prompts: 100%|████████████████████████████| 501/501 [00:04<00:00, 111.58it/s]
Batch size: 501, Time taken: 4.51 seconds
Processed prompts: 100%|████████████████████████████| 601/601 [00:05<00:00, 112.03it/s]
Batch size: 601, Time taken: 5.48 seconds
Processed prompts: 100%|████████████████████████████| 701/701 [00:06<00:00, 108.99it/s]
Batch size: 701, Time taken: 6.46 seconds
Processed prompts: 100%|████████████████████████████| 801/801 [00:07<00:00, 109.55it/s]
Batch size: 801, Time taken: 7.36 seconds
Processed prompts: 100%|████████████████████████████| 901/901 [00:08<00:00, 108.90it/s]
Batch size: 901, Time taken: 8.35 seconds
WARNING 12-11 02:35:05 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:35:05 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
File "/home/conic/llm_experimentation/./experimentation.py", line 143, in <module>
llm = LLM(model=model_name_or_path, quantization="awq", dtype="auto",
File "/home/conic/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
engine = cls(*engine_configs,
File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
self._init_workers(distributed_init_method)
File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
self._run_workers(
File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
output = executor(*args, **kwargs)
File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model
_init_distributed_environment(self.parallel_config, self.rank,
File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 407, in _init_distributed_environment
initialize_model_parallel(parallel_config.tensor_parallel_size,
File "/home/conic/.local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/parallel_state.py", line 64, in initialize_model_parallel
assert _TENSOR_MODEL_PARALLEL_GROUP is None, (
AssertionError: tensor model parallel group is already initialized
Also fails at batch size 901 without either logprobs
or prompt_logprobs
set. Is that expected behavior?
Fixed it by setting gpu memory limit
gpu_memory_utilization=0.6,
getting this even though gpu_memory_utilization=0.6,
@nlpkiddo-2001 Maybe try other values, such as 0.8
Thanks @tom-doerr that works, and i have another doubt
I am new to vLLM usage and i want to load and serve mistral 7b model using vLLM. Here is my brief understanding about vLLM.
LLM Engine => could handle offline batching (i.e list of prompts) Async LLM Engine => wrapped with LLM Engine and could server async calls individually but only through online serving (api_server.py) Now I need to process batch calls (i..e list of prompts) through API server batch request. prompts = ["Give me a haiku poem"] * 10
from same machine i could not send 10 request in async manner
Thanks @tom-doerr that works, and i have another doubt
I am new to vLLM usage and i want to load and serve mistral 7b model using vLLM. Here is my brief understanding about vLLM.
LLM Engine => could handle offline batching (i.e list of prompts) Async LLM Engine => wrapped with LLM Engine and could server async calls individually but only through online serving (api_server.py) Now I need to process batch calls (i..e list of prompts) through API server batch request. prompts = ["Give me a haiku poem"] * 10
from same machine i could not send 10 request in async manner
@nlpkiddo-2001 You might be able to do that using Huggingface TGI
how do we solve this? using google colab
If I set
prompt_logprobs
, I getAssertionError: tensor model parallel group is already initialized
.Output: