Metric Scores and NQ Evaluation

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

533 stars 39 forks source link

Metric Scores and NQ Evaluation #41

Closed Spico197 closed 8 months ago

Spico197 commented 8 months ago

Hi there, thanks very much for such an amazing work! Currently, I'm trying to reproduce the results from the paper, but I have the following problems:

Metric scores: lm-eval-harness would return acc and acc_norm at the same time. Do you have a preference to use one of them?
NQ evaluation: I've checked the evaluation codebase and your starter scripts, but have no clues about the evaluation on the Natural Questions (NQ) dataset (it is not implemented in lm-eval-harness yet). Did you write an evaluation script, or adopt an existing repo to get the scores?

Thank you so much for your time and response~

xiamengzhou commented 8 months ago

Hi! Sorry for the late reply:

We have update the script for evaluation nq, boolq and gsm8k here at run_eval.sh. We use nq_open in the harness repo.

We also updated the metrics we use in README.md file.

Hope it helps and feel free to reach out again for any further questions!

Spico197 commented 8 months ago

Thank you very much for the reply and the code update~