Closed wanzhenchn closed 6 months ago
@ywang96
@wanzhenchn
Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as --dataset_path
or --request_output_len
, so I cannot see in this case what is going on.
If you use the version from the main branch and specify --save-result
, we have a result json to keep information from the benchmark
https://github.com/vllm-project/vllm/blob/a53222544c6385ee314a26fdf42eb14f5b4e5ad9/benchmarks/benchmark_serving.py#L313-L333
I didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by ttft + sum(itl)
for each entry.
Please feel free to also make a PR to save output.latency
into this json, I'm not particularly against it.
@wanzhenchn
Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as
--dataset_path
or--request_output_len
, so I cannot see in this case what is going on.If you use the version from the main branch and specify
--save-result
, we have a result json to keep information from the benchmarkI didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by
ttft + sum(itl)
for each entry.Please feel free to also make a PR to save
output.latency
into this json, I'm not particularly against it.
@ywang96
Many thanks for your response.
I just modified the function async_request_openai_completions()
in backend_request_func.py to return the usage info for each request as follows:
@dataclass
class RequestFuncInput:
prompt: str
api_url: str
model: str
request_output_len: int
top_p: float
top_k: int
repetition_penalty: float
temperature: float
best_of: int = 1
use_beam_search: bool = False
@dataclass
class RequestFuncOutput:
generated_text: str = ""
success: bool = False
latency: float = 0.0
ttft: float = 0.0 # Time to first token
itl: List[float] = field(default_factory=list) # List of inter-token latencies
prompt_len: int = 0
output_len: int = 0
error: str = ""
async def async_request_openai_completions(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
api_url = request_func_input.api_url
assert api_url.endswith(
"v1/completions"
), "OpenAI Completions API URL must end with 'v1/completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
payload = {
"model": request_func_input.model,
"prompt": request_func_input.prompt,
"max_tokens": request_func_input.request_output_len,
"stream": True,
"temperature": request_func_input.temperature,
"top_p": request_func_input.top_p,
"top_k": request_func_input.top_k,
"repetition_penalty": request_func_input.repetition_penalty,
"best_of": request_func_input.best_of,
}
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
output = RequestFuncOutput()
generated_text = ""
ttft = 0.0
st = time.perf_counter()
most_recent_timestamp = st
try:
async with session.post(url=api_url, json=payload,
headers=headers) as response:
if response.status == 200:
async for chunk in response.content:
chunk = chunk.strip()
if not chunk:
continue
chunk = remove_prefix(chunk.decode("utf-8"), "data: ")
if chunk == "[DONE]":
latency = time.perf_counter() - st
else:
data = json.loads(chunk)
if data["choices"][0]["text"]:
timestamp = time.perf_counter()
# First token
if ttft == 0.0:
ttft = time.perf_counter() - st
output.ttft = ttft
# Decoding phase
# NOTE: Some completion API might have a last
# usage summary response without a token so we
# do not want to include as inter-token-latency
elif data.get("usage", None) is None:
output.itl.append(timestamp -
most_recent_timestamp)
most_recent_timestamp = timestamp
generated_text += data["choices"][0]["text"]
# get usage summary response
output.prompt_len = data["usage"]["prompt_tokens"]
output.output_len = data["usage"]["completion_tokens"]
output.generated_text = generated_text
output.success = True
output.latency = latency
except Exception:
output.success = False
exc_info = sys.exc_info()
output.error = "".join(traceback.format_exception(*exc_info))
if pbar:
pbar.update(1)
return output
Then the prompt_len
,output_len
and latency
can be returned through RequestFuncOutput
to calculate_metrics()
def calculate_metrics(outputs: List[RequestFuncOutput],
dur_s: float) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens = []
total_input = 0
completed = 0
tpots = []
ttfts = []
res_latency = []
for i in range(len(outputs)):
if outputs[i].success:
output_len = outputs[i].output_len
actual_output_lens.append(output_len)
# return latency for each request
res_latency.append(outputs[i].latency)
total_input += outputs[i].prompt_len
if output_len > 1:
tpots.append(
(outputs[i].latency - outputs[i].ttft) / (output_len - 1))
ttfts.append(outputs[i].ttft)
completed += 1
else:
actual_output_lens.append(0)
res_latency.append(0)
metrics = BenchmarkMetrics(
completed=completed,
total_input=total_input,
input_token_avg=total_input / completed,
total_output=sum(actual_output_lens),
output_token_avg=sum(actual_output_lens) / completed,
elapsed_time=dur_s,
request_throughput=completed / dur_s,
input_throughput=total_input / dur_s,
output_throughput=sum(actual_output_lens) / dur_s,
# ttfts is empty if streaming is not supported by backend
mean_ttft=np.mean(ttfts or 0),
median_ttft=np.median(ttfts or 0),
p99_ttft=np.percentile(ttfts or 0, 99),
mean_tpot=np.mean(tpots),
median_tpot=np.median(tpots),
p99_tpot=np.percentile(tpots, 99),
p90_latency=np.percentile(res_latency, 90),
p95_latency=np.percentile(res_latency, 95),
p99_latency=np.percentile(res_latency, 99),
avg_latency=np.mean(res_latency),
)
# for debug
import ipdb; ipdb.set_trace()
return metrics, actual_output_lens
The prompt_len
,output_len
and res_latency
for 3 requests shown in the screenshot above or below.
The ttft
and latency
seems obviously abnormal for the 3 cases. What's the matter?
BTW, The server is launched with --max-num-seqs 1
to test the performance for bach_szie = 1.
I have also set breakpoint in the offical benchmark_serving.py
and ran following command:
python benchmark_serving.py \
--model /data/models/vicuna-13b-v1.5 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--port 8014 \
--num-prompts 5
The ttft seems incorrect for prompt_len=30, which elapsed 3.9547242913395166s. The other cases also show the same phenomenon. @ywang96
Do you have any verification for the above-mentioned issue? @ywang96
@wanzhenchn do you figured out the reason why as you close this issue?
@wanzhenchn do you figured out the reason why as you close this issue?
I have found the reason, you can review the post https://github.com/vllm-project/vllm/issues/4252 to figure out that. @mces89
@wanzhenchn
Latency for each request depends on and input and especially output lengths - which is unclear in this case since it seems that you have modified the script yourself (we never had flags such as
--dataset_path
or--request_output_len
, so I cannot see in this case what is going on.If you use the version from the main branch and specify
--save-result
, we have a result json to keep information from the benchmarkI didn't add e2e latency for each request in this result json previously because it gives less useful information since it's generally dominated by the output length, but if you really need it, it can be calculated by
ttft + sum(itl)
for each entry.Please feel free to also make a PR to save
output.latency
into this json, I'm not particularly against it.
Do we have an easy way to get the latency result like ttft and tbt if we are using offline inference?
Proposal to improve performance
I have run 3 cases with benchmark_serving.py to conduct benchmark test.
The server is launched with
--max-num-seqs 1
to test the performance forbach_szie = 1
:However, the latency of each request is gradually increases, which seems abnormal.
So the
latency
of https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L254 refers to what?How to get the latency of each request?
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)