ngruver / llmtime

https://arxiv.org/abs/2310.07820
MIT License
628 stars 139 forks source link

How were the normalized scores aggregated? #16

Open abdulfatir opened 7 months ago

abdulfatir commented 7 months ago

Thank you for releasing the code! This is a very interesting piece of work. Congrats on the NeurIPS acceptance! 🎉

As per my understanding, you're aggregating normalized scores to report the final scaled score. It looks like you're using the arithmetic mean to aggregate the normalized scores. Please correct me if I am wrong.

Using the arithmetic mean may not be the best way of summarizing a normalized metric. This may lead to misleading conclusions. A better way to aggregate normalized scores is using the geometric mean. Please check this paper out for details:

Fleming, Philip J., and John J. Wallace. "How not to lie with statistics: the correct way to summarize benchmark results." Communications of the ACM 29.3 (1986): 218-221.

Based on the numbers in https://github.com/ngruver/llmtime/blob/main/precomputed_outputs/deterministic_csvs/monash.csv, here are the plots that I get using the arithmetic and geometric mean.

image

image

ngruver commented 7 months ago

Thanks for the note Abdul!

The reported values are an arithmetic mean and you're correct that this is probably suboptimal. Genuine apologies for the error on my part.

I am planning to update the arxiv with extended experiments from our NeurIPS camera-ready and I'll include this correction as well.

Please let me know if you have any other comments.

Nate

abdulfatir commented 7 months ago

@ngruver Thanks for your reply. It's an easy mistake to make. In fact, I only found out about the geometric mean idea very recently. Looking forward to the updated results.

Cheers!