vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.79k stars 4.1k forks source link

lots of blank before each runing step #3030

Open Eutenacity opened 7 months ago

Eutenacity commented 7 months ago

I use torch.profiler.profile() to profile mixtral based on vllm. And I found lots of blank before each runing step.

S85Z22{PW)GZ0(E)4AH4AF1

When i try to compare the time cost of vllm with that of tensorrt-llm. I found that tensorrt-llm is 1.5X faster than vllm. But by comparing the time cost of each component, including the attention, experts, all reduce. vllm and tensorrt-llm perform nearly the same.

So I suppose that the blank before each runing step in vllm results in the slower perfomance. But I can found nothing to understand the occur of the blank.

Can you give me some help?

Here is the code to profile the mixtral

from vllm import LLM, SamplingParams
import argparse
import evaluate
from datasets import load_dataset
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--dataset_path', type=str, default='')
    parser.add_argument(
        '--eval_task',
        type=str,
        default='code_completion',
        choices=['summarize', 'summarize_long', 'code_completion'])
    parser.add_argument('--batch_size', type=int, default=1)
    parser.add_argument('--max_ite', type=int, default=20)
    args = parser.parse_args()

    if args.eval_task == 'code_completion':
        dataset_name = "openai_humaneval"
        dataset_revision = None
        dataset_input_key = 'prompt'
        dataset_output_key = 'canonical_solution'
        dataset_split = 'test'
    elif args.eval_task == 'summarize':
        dataset_name = "ccdv/cnn_dailymail"
        dataset_revision = "3.0.0"
        dataset_input_key = 'article'
        dataset_output_key = 'highlights'
        dataset_split = 'test'
    elif args.eval_task == 'summarize_long':
        dataset_name = "tau/zero_scrolls"
        dataset_revision = 'squality'
        dataset_input_key = 'input'
        dataset_output_key = 'output'
        dataset_split = 'validation'  # only this split contains reference strings
    dataset = load_dataset(dataset_name,
                            dataset_revision,
                            cache_dir=args.dataset_path,
                            split=dataset_split)

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100,top_k=1,top_p=1e-5,temperature=1)

    # Create an LLM.
    llm = LLM(model="/home/.cache/huggingface/hub/models--mistralai--Mixtral-8x7B-v0.1/snapshots/985aa055896a8f943d4a9f2572e6ea1341823841", tensor_parallel_size=8)

    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    ite_count = 0
    data_point_idx = 0
    max_batch_size=args.batch_size

    print(len(dataset))
    metric_vllm=evaluate.load("rouge")
    import torch
    def decorate_trace_handler(rank):
        def trace_handler(prof):
            if rank in [0]:
                prof.export_chrome_trace("test"+str(rank)+".json")
        return trace_handler

    prof = torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        record_shapes=True,
        with_stack=False,
        schedule=torch.profiler.schedule(
            wait=5,
            warmup=5,
            active=2),
        on_trace_ready=decorate_trace_handler(0)
    )

    with prof:
        while (data_point_idx < len(dataset)) and (ite_count < args.max_ite):

            datapoint=dataset[0:1]
            batch_input_texts=datapoint[dataset_input_key]

            batch_size=len(batch_input_texts)
            append_str =' TL;DR: ' if args.eval_task == 'summarize' else ''
            prompts=[]
            for i in range(batch_size):
                curr_text = batch_input_texts[i] + append_str
                curr_text = curr_text.strip().replace(" n't", "n't")
                prompts.append(curr_text)

            outputs = llm.generate(prompts,sampling_params,use_tqdm=False)

            # print(prompts)
            for i in range(batch_size):
                metric_vllm.add_batch(
                                    predictions=[
                                        outputs[i].outputs[0].text
                                    ],
                                    references=[
                                        datapoint[dataset_output_key][i]
                                    ])
            data_point_idx += max_batch_size
            if ite_count==0:
                print(f"Prompt: {outputs[0].prompt!r}\n Generated text: {outputs[0].outputs[0].text!r}")

            ite_count += 1
            prof.step()

    computed_metrics_vllm = metric_vllm.compute()
    for key in computed_metrics_vllm.keys():
        print(f'  {key} : {computed_metrics_vllm[key]*100}')
Eutenacity commented 7 months ago

When I go deeper into the vllm, i find that the blank mainly comes from the ray worker in vllm/engine/llm_engine.py
So I want to know what is the ray worker used for?

simon-mo commented 7 months ago

Ray worker is used to host workers on different GPU for tensor parallel inference. The gap you are observing is probably kernel launch overhead. Which version are you using here? Newer vLLM version has enabled CUDA graph capture which reduces these overhead.

Eutenacity commented 7 months ago

Ray worker is used to host workers on different GPU for tensor parallel inference. The gap you are observing is probably kernel launch overhead. Which version are you using here? Newer vLLM version has enabled CUDA graph capture which reduces these overhead.

0.3.0