[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)

KimMinSang96 commented 2 months ago

I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and Orchestrator mode. Does vLLM support this functionality separately? Or should I implement it similarly to the tensorrt-llm backend?

Here is for reference url : https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#leader-mode

stas00 commented 1 month ago

It works fine with the online mode - you just create multiple servers (even reusing the same gpus!), but indeed it doesn't work with the offline mode. Here is an example on a 8x H100 node

from vllm import LLM, SamplingParams

import multiprocessing

def main():

    llm1 = LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        tensor_parallel_size=8,
        gpu_memory_utilization=0.65,
    )

    llm2 = LLM(
        model="microsoft/phi-1_5",
        tensor_parallel_size=8,
        gpu_memory_utilization=0.25,
    )

    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    outputs = llm1.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    outputs = llm2.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    multiprocessing.freeze_support()
    main()

and then:

VLLM_WORKER_MULTIPROC_METHOD=spawn python offline-2models-2.py

and it hangs while initializing the 2nd model:

INFO 08-01 01:10:37 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='microsoft/phi-1_5', speculative_config=None, tokenizer='microsoft/phi-1_5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/phi-1_5, use_v2_block_manager=False, enable_prefix_caching=False)
(VllmWorkerProcess pid=3445429) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445426) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445431) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445427) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445432) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445428) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3445430) INFO 08-01 01:10:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks

stas00 commented 1 month ago

The problem seems to be in some internal state that is not being isolated, even if I do:

    llm1 = LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        tensor_parallel_size=8,
        gpu_memory_utilization=0.65,
    )

    del llm1

    llm2 = LLM(
        model="microsoft/phi-1_5",
        tensor_parallel_size=8,
        gpu_memory_utilization=0.25,
    )

it still hangs in the init of the 2nd model. While this del would be impractical for what we are trying to do, this demonstrates that vllm isn't capable of handling multi-models in the offline mode which is a pity.

njhill commented 1 month ago

@stas00 at least the latter case I have been debugging and will open a fix today. Can see if it also works with concurrent llms but I expect there mat be additional isolation changes needed for that.

stas00 commented 1 month ago

Thanks a lot for working on that, @njhill - that will help with disagrregation type of offline use of vllm.

mces89 commented 1 week ago

@stas00 I wonder if it's possible to create multiple servers in the same gpu if the gpu memory is not an issue?

stas00 commented 1 week ago

with online setup yes it'd work, but this is an offline recipe

please read https://github.com/vllm-project/vllm/issues/6155#issuecomment-2261755228

vllm-project / vllm

[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs) #6155