Discrepancy between results reported in this repo and in the NeoX paper

ghost commented 1 year ago

Hello. I recently noticed that the downstream numbers reported in this repo (and on the huggingface page) don't quite match up with what I get when I run eval myself using the lm-evaluation-harness. The numbers I get are consistent with what was reported in the GPT-NeoX paper. For example, this repo reports a zero-shot HellaSwag score of 66.1, while I (and the NeoX authors) get a score of 51.8. I was hoping you could help me get to the bottom of what the differences are in evaluation methodology, since it seems like you also use the same eval harness. I have already ruled out dataset contamination as a source of the difference as neither my evaluation nor the NeoX evaluation uses test-time decontamination. Thanks in advance for helping clarify.

kingoflolz commented 1 year ago

For multiple choice evals, the eval harness either ranks the choices with sum of logprob (reported as acc) or the average logprob per token (reported as acc_norm). This matches the evaluation procedure in the GPT3 paper, "For most tasks we compare the per-token likelihood (to normalize for length)". I chose the evaluation method for each model and benchmark combination which to maximize the score. For most benchmarks, the difference is very small, but it makes a large difference in hellaswag.

ghost commented 1 year ago

@kingoflolz Thanks for the quick response - that clarifies things a lot

kingoflolz / mesh-transformer-jax

Discrepancy between results reported in this repo and in the NeoX paper #257