mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.24k stars 536 forks source link

[llama] Duplicated EOS tokens affect performance measurements #1545

Closed szutenberg closed 9 months ago

szutenberg commented 10 months ago

First issue discussed on the today taskforce meeting:

Reproducer: duplicated_eos_issue.py.txt

Output:

================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 455.085
Tokens per second: 2730.51
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes

Expected output:

================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 455.085
Tokens per second: 2275.43
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes

Why?

All responses are [10, 1, 5, 3, 2, 2] => there are two EOS tokens (value "2"). All tokens after first EOS should be ignored for latency (TPOT in Server) and throughput (Tokens per second in Offline) measurements.

szutenberg commented 10 months ago

@pgmpablo157321

pgmpablo157321 commented 10 months ago

@szutenberg I think this should be dealt by the reference (and other) implementations. Loadgen should be in general agnostic to the output of the model

nv-alicheng commented 10 months ago

@pgmpablo157321

In general LoadGen should be agnostic to the output of the model and not do any kind of post-processing. I think this should include dropping parts of the output

I think the issue here is that LoadGen needs to assert that the output provided to it does not end in EOS tokens since without this assertion, it can lead to abuse.

Let's say I have an accelerator and the true tok/sec inference rate is 50 tok/sec. For ease of math, let max_seq_len be 1000. Let's also make the assumption that when given an input sequence that ends in an EOS token, it will append another EOS token, but it will still do the inference computation.

In run A, I do not implement early stopping, and generate 1000 tokens in 20 seconds, 500 of which are EOS tokens. Loadgen, being agnostic, will count all these EOS tokens, and report 50 tok/sec.

In run B, I stop immediately upon seeing an EOS token, and artificially pad it with 500 EOS tokens. I actually generated 500 non-EOS tokens in 10 seconds, but Loadgen sees 1000, and then reports 100 tok/sec.

This can lead to a lot of abuse. EOS tokens are ignored by the accuracy checker since the tokens are decoded and retokenized for NLTK, so the accuracy test won't catch it. The way around this is that Loadgen should not count EOS tokens towards the tokens/sec metric. This is effectively just stripping away the EOS token from the input to Loadgen (the QueryResponse).

Now our issue is: Who strips away the EOS tokens? If the SUT strips it, Loadgen still needs to assert that no EOS tokens were given to it, since there's not a guarantee that an SUT removed them or didn't inject them. How is this any different from Loadgen stripping it?

Another point is: tok/sec should mean "non-special (readable) tokens / second". Is MLPerf trying to benchmark the ops/sec of an accelerator, or the performance of an accelerator in the context of consumer workloads? I always thought it was the latter. In run A, even though it did all the computations, if it ends up output 500 EOS tokens, run A's implementation should be penalized since those tokens don't actually matter to the consumer workload.

tl;dr: To prevent cheating and also to adhere to what is presumably the goal of MLPerf's benchmark, either LoadGen needs to assert that EOS tokens were removed, or remove them itself.

cc: @nvzhihanj

nv-alicheng commented 10 months ago

I should add that theoretically it is possible to write an audit test that:

  1. Runs a performance test and stores the SUT responses.
  2. Checks for the presence of EOS tokens.
  3. If none are present it passes immediately.
  4. If they are, it starts a second performance run where it does something like appending EOS tokens to the input to force the model to only output EOS tokens. This should also trigger an early EOS detection mechanisms.
  5. Compare the tok/sec with the original performance run. If it was significantly faster the second time, then there might be cheating happening.

The above probably isn't completely sufficient for cheat detection, but even just this seems like a lot of work and effort to write a test for, only for a single model (maybe 2 with GPT-J?), simply to be completely 100% agnostic to the model, when you can guarantee it with a simple check on the responses.

szutenberg commented 10 months ago

I think we want LoadGen to have as less overhead as possible. Also, I believe that llama2 is not the only benchmark which will be using our new token_latencies feature. Therefore, I agree with @pgmpablo157321 : LoadGen should not be aware "what is eos token id in llama" and "how to postprocess output tokens". Implementing benchmark specific test is necessary.

I don't understand the flow described by @nv-alicheng . The only thing to do would be to check if there are duplicated EOS tokens in mlperf accuracy file. As a quick solution, we can modify evaluate-accuracy.py to return 0 accuracy if duplicated EOS are found (to be turned off by extra flag which explicitly won't be allowed for the submission). In this case no extra compliance test would be required so we'd save some effort.

In GPT-J this issue does not matter since we don't count performance in tokens per second.

pgmpablo157321 commented 10 months ago

The above probably isn't completely sufficient for cheat detection, but even just this seems like a lot of work and effort to write a test for, only for a single model (maybe 2 with GPT-J?), simply to be completely 100% agnostic to the model, when you can guarantee it with a simple check on the responses

@nv-alicheng I agree that seems like a compliance test that requires a lot of work and that we can simply check the responses. I would suggest however to check them not directly through loadgen, but to do a compliance test for this as well. We could take the results of the accuracy run or the results of the test01 run and make a verification script for this that checks the responses. And that would be a new compliance test.

If we need to make a run for the test (to maintain consistency), it could be a run similar to test01

szutenberg commented 9 months ago

The issue is solved by new compliance test: https://github.com/mlcommons/inference/pull/1556 Closing the issue.