vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.96k stars 4.53k forks source link

[Usage]: How to get the latency of each request with benchmark_serving.py #4145

Closed wanzhenchn closed 6 months ago

wanzhenchn commented 6 months ago

Proposal to improve performance

I have run 3 cases with benchmark_serving.py to conduct benchmark test.

The server is launched with --max-num-seqs 1 to test the performance for bach_szie = 1:

python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8014 \
--model /data/models/vicuna-13b-v1.5 \
--dtype auto -tp 2 --max-model-len 4096 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--swap-space 16

However, the latency of each request is gradually increases, which seems abnormal.

So the latency of https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L254 refers to what?

How to get the latency of each request?

image

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-52-shopee-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A30
GPU 1: NVIDIA A30

Nvidia driver version: 535.104.12
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          144
On-line CPU(s) list:             0-143
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              36
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3.4 MiB (72 instances)
L1i cache:                       2.3 MiB (72 instances)
L2 cache:                        90 MiB (72 instances)
L3 cache:                        108 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,142
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127,129,131,133,135,137,139,141,143
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] onnx==1.14.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.2.1
[pip3] torch-tensorrt==0.0.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.16.0a0
[pip3] triton==2.2.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: 8.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  SYS 0,2,4,6,8,10    0       N/A
GPU1    SYS  X  1,3,5,7,9,11    1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
simon-mo commented 6 months ago

@ywang96

ywang96 commented 6 months ago

@wanzhenchn

Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as --dataset_path or --request_output_len, so I cannot see in this case what is going on.

If you use the version from the main branch and specify --save-result, we have a result json to keep information from the benchmark https://github.com/vllm-project/vllm/blob/a53222544c6385ee314a26fdf42eb14f5b4e5ad9/benchmarks/benchmark_serving.py#L313-L333

I didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by ttft + sum(itl) for each entry.

Please feel free to also make a PR to save output.latency into this json, I'm not particularly against it.

wanzhenchn commented 6 months ago

@wanzhenchn

Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as --dataset_path or --request_output_len, so I cannot see in this case what is going on.

If you use the version from the main branch and specify --save-result, we have a result json to keep information from the benchmark

https://github.com/vllm-project/vllm/blob/a53222544c6385ee314a26fdf42eb14f5b4e5ad9/benchmarks/benchmark_serving.py#L313-L333

I didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by ttft + sum(itl) for each entry.

Please feel free to also make a PR to save output.latency into this json, I'm not particularly against it.

@ywang96

Many thanks for your response.

I just modified the function async_request_openai_completions() in backend_request_func.py to return the usage info for each request as follows:

@dataclass
class RequestFuncInput:
    prompt: str
    api_url: str
    model: str
    request_output_len: int
    top_p: float
    top_k: int
    repetition_penalty: float
    temperature: float
    best_of: int = 1
    use_beam_search: bool = False

@dataclass
class RequestFuncOutput:
    generated_text: str = ""
    success: bool = False
    latency: float = 0.0
    ttft: float = 0.0  # Time to first token
    itl: List[float] = field(default_factory=list)  # List of inter-token latencies
    prompt_len: int = 0
    output_len: int = 0
    error: str = ""

async def async_request_openai_completions(
    request_func_input: RequestFuncInput,
    pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
    api_url = request_func_input.api_url
    assert api_url.endswith(
        "v1/completions"
    ), "OpenAI Completions API URL must end with 'v1/completions'."

    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
        assert not request_func_input.use_beam_search
        payload = {
            "model": request_func_input.model,
            "prompt": request_func_input.prompt,
            "max_tokens": request_func_input.request_output_len,
            "stream": True,
            "temperature": request_func_input.temperature,
            "top_p": request_func_input.top_p,
            "top_k": request_func_input.top_k,
            "repetition_penalty": request_func_input.repetition_penalty,
            "best_of": request_func_input.best_of,
        }
        headers = {
            "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
        }

        output = RequestFuncOutput()

        generated_text = ""
        ttft = 0.0
        st = time.perf_counter()
        most_recent_timestamp = st
        try:
            async with session.post(url=api_url, json=payload,
                                    headers=headers) as response:
                if response.status == 200:
                    async for chunk in response.content:
                        chunk = chunk.strip()
                        if not chunk:
                            continue

                        chunk = remove_prefix(chunk.decode("utf-8"), "data: ")
                        if chunk == "[DONE]":
                            latency = time.perf_counter() - st
                        else:
                            data = json.loads(chunk)

                            if data["choices"][0]["text"]:
                                timestamp = time.perf_counter()
                                # First token
                                if ttft == 0.0:
                                    ttft = time.perf_counter() - st
                                    output.ttft = ttft

                                # Decoding phase
                                # NOTE: Some completion API might have a last
                                # usage summary response without a token so we
                                # do not want to include as inter-token-latency
                                elif data.get("usage", None) is None:
                                    output.itl.append(timestamp -
                                                      most_recent_timestamp)

                                most_recent_timestamp = timestamp
                                generated_text += data["choices"][0]["text"]

                    # get usage summary response
                    output.prompt_len = data["usage"]["prompt_tokens"]
                    output.output_len = data["usage"]["completion_tokens"]
                    output.generated_text = generated_text
                    output.success = True
                    output.latency = latency
        except Exception:
            output.success = False
            exc_info = sys.exc_info()
            output.error = "".join(traceback.format_exception(*exc_info))

    if pbar:
        pbar.update(1)
    return output

Then the prompt_len ,output_len and latency can be returned through RequestFuncOutput to calculate_metrics()

def calculate_metrics(outputs: List[RequestFuncOutput],
                      dur_s: float) -> Tuple[BenchmarkMetrics, List[int]]:
    actual_output_lens = []
    total_input = 0
    completed = 0
    tpots = []
    ttfts = []
    res_latency = []

    for i in range(len(outputs)):
        if outputs[i].success:
            output_len = outputs[i].output_len
            actual_output_lens.append(output_len)

            # return latency for each request
            res_latency.append(outputs[i].latency)

            total_input += outputs[i].prompt_len
            if output_len > 1:
                tpots.append(
                    (outputs[i].latency - outputs[i].ttft) / (output_len - 1))
            ttfts.append(outputs[i].ttft)
            completed += 1
        else:
            actual_output_lens.append(0)
            res_latency.append(0)

    metrics = BenchmarkMetrics(
        completed=completed,
        total_input=total_input,
        input_token_avg=total_input / completed,
        total_output=sum(actual_output_lens),
        output_token_avg=sum(actual_output_lens) / completed,
        elapsed_time=dur_s,
        request_throughput=completed / dur_s,
        input_throughput=total_input / dur_s,
        output_throughput=sum(actual_output_lens) / dur_s,
        # ttfts is empty if streaming is not supported by backend
        mean_ttft=np.mean(ttfts or 0),
        median_ttft=np.median(ttfts or 0),
        p99_ttft=np.percentile(ttfts or 0, 99),
        mean_tpot=np.mean(tpots),
        median_tpot=np.median(tpots),
        p99_tpot=np.percentile(tpots, 99),
        p90_latency=np.percentile(res_latency, 90),
        p95_latency=np.percentile(res_latency, 95),
        p99_latency=np.percentile(res_latency, 99),
        avg_latency=np.mean(res_latency),
    )

    # for debug
    import ipdb; ipdb.set_trace()

    return metrics, actual_output_lens

The prompt_len ,output_len and res_latency for 3 requests shown in the screenshot above or below. image

The ttft and latency seems obviously abnormal for the 3 cases. What's the matter?

BTW, The server is launched with --max-num-seqs 1 to test the performance for bach_szie = 1.

wanzhenchn commented 6 months ago

I have also set breakpoint in the offical benchmark_serving.py and ran following command:

python benchmark_serving.py \ 
--model /data/models/vicuna-13b-v1.5 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--port 8014 \
--num-prompts 5

The ttft seems incorrect for prompt_len=30, which elapsed 3.9547242913395166s. The other cases also show the same phenomenon. @ywang96

image

wanzhenchn commented 6 months ago

Do you have any verification for the above-mentioned issue? @ywang96

mces89 commented 6 months ago

@wanzhenchn do you figured out the reason why as you close this issue?

wanzhenchn commented 6 months ago

@wanzhenchn do you figured out the reason why as you close this issue?

I have found the reason, you can review the post https://github.com/vllm-project/vllm/issues/4252 to figure out that. @mces89

George-ao commented 6 months ago

@wanzhenchn

Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as --dataset_path or --request_output_len, so I cannot see in this case what is going on.

If you use the version from the main branch and specify --save-result, we have a result json to keep information from the benchmark

https://github.com/vllm-project/vllm/blob/a53222544c6385ee314a26fdf42eb14f5b4e5ad9/benchmarks/benchmark_serving.py#L313-L333

I didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by ttft + sum(itl) for each entry.

Please feel free to also make a PR to save output.latency into this json, I'm not particularly against it.

Do we have an easy way to get the latency result like ttft and tbt if we are using offline inference?