Open abdulfatir opened 7 months ago
Thanks for the note Abdul!
The reported values are an arithmetic mean and you're correct that this is probably suboptimal. Genuine apologies for the error on my part.
I am planning to update the arxiv with extended experiments from our NeurIPS camera-ready and I'll include this correction as well.
Please let me know if you have any other comments.
Nate
@ngruver Thanks for your reply. It's an easy mistake to make. In fact, I only found out about the geometric mean idea very recently. Looking forward to the updated results.
Cheers!
Thank you for releasing the code! This is a very interesting piece of work. Congrats on the NeurIPS acceptance! 🎉
As per my understanding, you're aggregating normalized scores to report the final scaled score. It looks like you're using the arithmetic mean to aggregate the normalized scores. Please correct me if I am wrong.
Using the arithmetic mean may not be the best way of summarizing a normalized metric. This may lead to misleading conclusions. A better way to aggregate normalized scores is using the geometric mean. Please check this paper out for details:
Fleming, Philip J., and John J. Wallace. "How not to lie with statistics: the correct way to summarize benchmark results." Communications of the ACM 29.3 (1986): 218-221.
Based on the numbers in https://github.com/ngruver/llmtime/blob/main/precomputed_outputs/deterministic_csvs/monash.csv, here are the plots that I get using the arithmetic and geometric mean.