Thanks for releasing the code. I have one question about the evaluation. It seems in the current version of the code, you only evaluate perplexity? For example, I think Table 1 of the paper, its metric should be Accuracy for most QA tasks? It seems current eval_harness.py only considers ppl.
Hi!
Thanks for releasing the code. I have one question about the evaluation. It seems in the current version of the code, you only evaluate perplexity? For example, I think Table 1 of the paper, its metric should be Accuracy for most QA tasks? It seems current eval_harness.py only considers ppl.