mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.18k stars 518 forks source link

MoE Server ignores setting TTFT/TPOT latencies from user.conf #1787

Open psyhtest opened 1 month ago

psyhtest commented 1 month ago

For an Open submission, we have tried setting TTFT/TPOT latencies in user.conf e.g.:

mixtral-8x7b.Server.ttft_latency = 3000
mixtral-8x7b.Server.tpot_latency = 300

However, in the resulting mlperf_log_summary.txt we would still see:

ttft_latency (ns): 2000000000
tpot_latency (ns): 200000000

Only manually modifying mlperf.conf would get us the right values.

This may not be specific to MoE, but to all LLM benchmarks.

arjunsuresh commented 1 month ago

@psyhtest Are we allowed to change the server scenario latency for the open division?

psyhtest commented 1 month ago

I can't see why not?

arjunsuresh commented 1 month ago

I don't recall seeing that in the rules 😇.

psyhtest commented 1 month ago

As usual, "anything goes" in Open.

arjunsuresh commented 1 month ago

I was wrong - this is allowed in the rules. We should get the user.conf working.

pgmpablo157321 commented 1 month ago

@psyhtest You also need to place the use_token_latencies flag

mixtral-8x7b.*.use_token_latencies = 1
mixtral-8x7b.Server.ttft_latency = 3000
mixtral-8x7b.Server.tpot_latency = 300