huggingface / cosmopedia

Apache License 2.0
458 stars 45 forks source link

questions about evaluation like MMLU #27

Open ftgreat opened 3 months ago

ftgreat commented 3 months ago

Thank you for sharing.

Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities.

Can you explain why MMLU evaluations use a zero-shot plus option content approach?

According to your blog, in this setup, MMLU evaluations are higher than those of QWen1.5B and Phi models, whereas in 5-shot evaluations, the conclusion is the opposite. Is this situation reasonable? Thank you.