Closed lucasjinreal closed 1 year ago
Sometimes it does happen on individual tasks that a smaller model outperforms a larger one slightly, especially when the model sizes are close. We've also observed this before in our previous paper.
then how to more properly eval the real performance in terms of different params. From these anli field, 3b out perform all to 7b, does the vocab sizes also matter?
To reliably evaluate these models, we generally look at many tasks in aggregation instead of just individual ones.
from the table, seems 7b acc not higher than 3b, why?