Open KimMinSang96 opened 2 months ago
It works fine with the online mode - you just create multiple servers (even reusing the same gpus!), but indeed it doesn't work with the offline mode. Here is an example on a 8x H100 node
from vllm import LLM, SamplingParams
import multiprocessing
def main():
llm1 = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=8,
gpu_memory_utilization=0.65,
)
llm2 = LLM(
model="microsoft/phi-1_5",
tensor_parallel_size=8,
gpu_memory_utilization=0.25,
)
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm1.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
outputs = llm2.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == '__main__':
multiprocessing.freeze_support()
main()
and then:
VLLM_WORKER_MULTIPROC_METHOD=spawn python offline-2models-2.py
and it hangs while initializing the 2nd model:
INFO 08-01 01:10:37 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/phi-1_5', speculative_config=None, tokenizer='microsoft/phi-1_5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/phi-1_5, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=3445429) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445426) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445431) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445427) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445432) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445428) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445430) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
The problem seems to be in some internal state that is not being isolated, even if I do:
llm1 = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=8,
gpu_memory_utilization=0.65,
)
del llm1
llm2 = LLM(
model="microsoft/phi-1_5",
tensor_parallel_size=8,
gpu_memory_utilization=0.25,
)
it still hangs in the init of the 2nd model. While this del
would be impractical for what we are trying to do, this demonstrates that vllm isn't capable of handling multi-models in the offline mode which is a pity.
@stas00 at least the latter case I have been debugging and will open a fix today. Can see if it also works with concurrent llms but I expect there mat be additional isolation changes needed for that.
Thanks a lot for working on that, @njhill - that will help with disagrregation type of offline use of vllm.
@stas00 I wonder if it's possible to create multiple servers in the same gpu if the gpu memory is not an issue?
with online setup yes it'd work, but this is an offline recipe
please read https://github.com/vllm-project/vllm/issues/6155#issuecomment-2261755228
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and Orchestrator mode. Does vLLM support this functionality separately? Or should I implement it similarly to the tensorrt-llm backend?
Here is for reference url : https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#leader-mode