Closed szutenberg closed 9 months ago
@pgmpablo157321
@szutenberg I think this should be dealt by the reference (and other) implementations. Loadgen should be in general agnostic to the output of the model
@pgmpablo157321
In general LoadGen should be agnostic to the output of the model and not do any kind of post-processing. I think this should include dropping parts of the output
I think the issue here is that LoadGen needs to assert that the output provided to it does not end in EOS tokens since without this assertion, it can lead to abuse.
Let's say I have an accelerator and the true tok/sec inference rate is 50 tok/sec. For ease of math, let max_seq_len be 1000. Let's also make the assumption that when given an input sequence that ends in an EOS token, it will append another EOS token, but it will still do the inference computation.
In run A, I do not implement early stopping, and generate 1000 tokens in 20 seconds, 500 of which are EOS tokens. Loadgen, being agnostic, will count all these EOS tokens, and report 50 tok/sec.
In run B, I stop immediately upon seeing an EOS token, and artificially pad it with 500 EOS tokens. I actually generated 500 non-EOS tokens in 10 seconds, but Loadgen sees 1000, and then reports 100 tok/sec.
This can lead to a lot of abuse. EOS tokens are ignored by the accuracy checker since the tokens are decoded and retokenized for NLTK, so the accuracy test won't catch it. The way around this is that Loadgen should not count EOS tokens towards the tokens/sec metric. This is effectively just stripping away the EOS token from the input to Loadgen (the QueryResponse).
Now our issue is: Who strips away the EOS tokens? If the SUT strips it, Loadgen still needs to assert that no EOS tokens were given to it, since there's not a guarantee that an SUT removed them or didn't inject them. How is this any different from Loadgen stripping it?
Another point is: tok/sec should mean "non-special (readable) tokens / second". Is MLPerf trying to benchmark the ops/sec of an accelerator, or the performance of an accelerator in the context of consumer workloads? I always thought it was the latter. In run A, even though it did all the computations, if it ends up output 500 EOS tokens, run A's implementation should be penalized since those tokens don't actually matter to the consumer workload.
tl;dr: To prevent cheating and also to adhere to what is presumably the goal of MLPerf's benchmark, either LoadGen needs to assert that EOS tokens were removed, or remove them itself.
cc: @nvzhihanj
I should add that theoretically it is possible to write an audit test that:
The above probably isn't completely sufficient for cheat detection, but even just this seems like a lot of work and effort to write a test for, only for a single model (maybe 2 with GPT-J?), simply to be completely 100% agnostic to the model, when you can guarantee it with a simple check on the responses.
I think we want LoadGen to have as less overhead as possible. Also, I believe that llama2 is not the only benchmark which will be using our new token_latencies feature. Therefore, I agree with @pgmpablo157321 : LoadGen should not be aware "what is eos token id in llama" and "how to postprocess output tokens". Implementing benchmark specific test is necessary.
I don't understand the flow described by @nv-alicheng . The only thing to do would be to check if there are duplicated EOS tokens in mlperf accuracy file. As a quick solution, we can modify evaluate-accuracy.py to return 0 accuracy if duplicated EOS are found (to be turned off by extra flag which explicitly won't be allowed for the submission). In this case no extra compliance test would be required so we'd save some effort.
In GPT-J this issue does not matter since we don't count performance in tokens per second.
The above probably isn't completely sufficient for cheat detection, but even just this seems like a lot of work and effort to write a test for, only for a single model (maybe 2 with GPT-J?), simply to be completely 100% agnostic to the model, when you can guarantee it with a simple check on the responses
@nv-alicheng I agree that seems like a compliance test that requires a lot of work and that we can simply check the responses. I would suggest however to check them not directly through loadgen, but to do a compliance test for this as well. We could take the results of the accuracy run or the results of the test01 run and make a verification script for this that checks the responses. And that would be a new compliance test.
If we need to make a run for the test (to maintain consistency), it could be a run similar to test01
The issue is solved by new compliance test: https://github.com/mlcommons/inference/pull/1556 Closing the issue.
First issue discussed on the today taskforce meeting:
Reproducer: duplicated_eos_issue.py.txt
Output:
Expected output:
Why?
All responses are [10, 1, 5, 3, 2, 2] => there are two EOS tokens (value "2"). All tokens after first EOS should be ignored for latency (TPOT in Server) and throughput (Tokens per second in Offline) measurements.