vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.67k stars 4.48k forks source link

AssertionError: tensor model parallel group is already initialized #2007

Closed tom-doerr closed 10 months ago

tom-doerr commented 11 months ago

If I set prompt_logprobs, I get AssertionError: tensor model parallel group is already initialized.

import time
from vllm import LLM, SamplingParams

prompts = [
    "write a 10000 word essay on the topic of ai",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95,
        max_tokens=int(2**0),
        prompt_logprobs=1,
        )

model_name_or_path = "TheBloke/Xwin-LM-13B-V0.2-AWQ"
llm = LLM(model=model_name_or_path, quantization="awq", dtype="auto",
        )

for batch_size in range(1, 1000, 100):
    try:
        start = time.time()
        outputs = llm.generate(prompts * batch_size, sampling_params)
        print(f'Batch size: {batch_size}, Time taken: {time.time() - start:.2f} seconds')
    except Exception as e:
        print(e)
        break

Output:


WARNING 12-11 02:28:43 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:28:43 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 12-11 02:28:50 llm_engine.py:207] # GPU blocks: 2178, # CPU blocks: 327
Processed prompts: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 14.36it/s]
Batch size: 1, Time taken: 0.08 seconds
Processed prompts: 100%|████████████████████████████| 101/101 [00:01<00:00, 100.92it/s]
Batch size: 101, Time taken: 1.01 seconds
Processed prompts:   0%|                                       | 0/201 [00:00<?, ?it/s]CUDA out of memory. Tried to allocate 1.63 GiB. GPU 0 has a total capacty of 39.39 GiB of which 704.88 MiB is free. Including non-PyTorch memory, this process has 38.68 GiB memory in use. Of the allocated memory 37.93 GiB is allocated by PyTorch, and 110.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Processed prompts:   0%|                                       | 0/201 [00:02<?, ?it/s]
WARNING 12-11 02:28:55 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:28:55 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
  File "/home/conic/llm_experimentation/./experimentation.py", line 142, in <module>
    llm = LLM(model=model_name_or_path, quantization="awq", dtype="auto",
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self._init_workers(distributed_init_method)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
    self._run_workers(
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model
    _init_distributed_environment(self.parallel_config, self.rank,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 407, in _init_distributed_environment
    initialize_model_parallel(parallel_config.tensor_parallel_size,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/parallel_state.py", line 64, in initialize_model_parallel
    assert _TENSOR_MODEL_PARALLEL_GROUP is None, (
AssertionError: tensor model parallel group is already initialized
conic@conicserver:~/llm_experimentation$ 
tom-doerr commented 11 months ago

Also happens when I just set logprobs=1,. Output:


WARNING 12-11 02:34:15 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:34:15 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 12-11 02:34:21 llm_engine.py:207] # GPU blocks: 2178, # CPU blocks: 327
Processed prompts: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 14.32it/s]
Batch size: 1, Time taken: 0.09 seconds
Processed prompts: 100%|████████████████████████████| 101/101 [00:00<00:00, 107.95it/s]
Batch size: 101, Time taken: 0.94 seconds
Processed prompts: 100%|████████████████████████████| 201/201 [00:01<00:00, 114.76it/s]
Batch size: 201, Time taken: 1.76 seconds
Processed prompts: 100%|████████████████████████████| 301/301 [00:02<00:00, 105.77it/s]
Batch size: 301, Time taken: 2.86 seconds
Processed prompts: 100%|████████████████████████████| 401/401 [00:03<00:00, 110.28it/s]
Batch size: 401, Time taken: 3.66 seconds
Processed prompts: 100%|████████████████████████████| 501/501 [00:04<00:00, 111.58it/s]
Batch size: 501, Time taken: 4.51 seconds
Processed prompts: 100%|████████████████████████████| 601/601 [00:05<00:00, 112.03it/s]
Batch size: 601, Time taken: 5.48 seconds
Processed prompts: 100%|████████████████████████████| 701/701 [00:06<00:00, 108.99it/s]
Batch size: 701, Time taken: 6.46 seconds
Processed prompts: 100%|████████████████████████████| 801/801 [00:07<00:00, 109.55it/s]
Batch size: 801, Time taken: 7.36 seconds
Processed prompts: 100%|████████████████████████████| 901/901 [00:08<00:00, 108.90it/s]
Batch size: 901, Time taken: 8.35 seconds
WARNING 12-11 02:35:05 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-11 02:35:05 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer='TheBloke/Xwin-LM-13B-V0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
  File "/home/conic/llm_experimentation/./experimentation.py", line 143, in <module>
    llm = LLM(model=model_name_or_path, quantization="awq", dtype="auto",
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self._init_workers(distributed_init_method)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers
    self._run_workers(
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 65, in init_model
    _init_distributed_environment(self.parallel_config, self.rank,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 407, in _init_distributed_environment
    initialize_model_parallel(parallel_config.tensor_parallel_size,
  File "/home/conic/.local/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/parallel_state.py", line 64, in initialize_model_parallel
    assert _TENSOR_MODEL_PARALLEL_GROUP is None, (
AssertionError: tensor model parallel group is already initialized
tom-doerr commented 11 months ago

Also fails at batch size 901 without either logprobs or prompt_logprobs set. Is that expected behavior?

tom-doerr commented 10 months ago

Fixed it by setting gpu memory limit

            gpu_memory_utilization=0.6,
nlpkiddo-2001 commented 9 months ago

getting this even though gpu_memory_utilization=0.6,

tom-doerr commented 9 months ago

@nlpkiddo-2001 Maybe try other values, such as 0.8

nlpkiddo-2001 commented 9 months ago

Thanks @tom-doerr that works, and i have another doubt

I am new to vLLM usage and i want to load and serve mistral 7b model using vLLM. Here is my brief understanding about vLLM.

LLM Engine => could handle offline batching (i.e list of prompts) Async LLM Engine => wrapped with LLM Engine and could server async calls individually but only through online serving (api_server.py) Now I need to process batch calls (i..e list of prompts) through API server batch request. prompts = ["Give me a haiku poem"] * 10

from same machine i could not send 10 request in async manner

tom-doerr commented 9 months ago

Thanks @tom-doerr that works, and i have another doubt

I am new to vLLM usage and i want to load and serve mistral 7b model using vLLM. Here is my brief understanding about vLLM.

LLM Engine => could handle offline batching (i.e list of prompts) Async LLM Engine => wrapped with LLM Engine and could server async calls individually but only through online serving (api_server.py) Now I need to process batch calls (i..e list of prompts) through API server batch request. prompts = ["Give me a haiku poem"] * 10

from same machine i could not send 10 request in async manner

@nlpkiddo-2001 You might be able to do that using Huggingface TGI

g-i-o-r-g-i-o commented 9 months ago

how do we solve this? using google colab