Evaluation of peft models using lm-eval-harness

huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

https://huggingface.co/docs/peft

Apache License 2.0

16.56k stars 1.64k forks source link

Evaluation of peft models using lm-eval-harness #2182

Closed JINO-ROHIT closed 2 days ago

JINO-ROHIT commented 1 month ago

Feature request

I would like to make an example notebook for evaluating the peft model for reproducable tasks and metrics using the lm-eval harness if possible .

Library here - https://github.com/EleutherAI/lm-evaluation-harness

Motivation

Evaluating LLMs often involves benchmark datasets, but minor implementation details can significantly affect results, making it difficult to compare outcomes across different codebases. This repo puts together a standardized method of evaluating models but i found very limited resources on how to apply across peft models and lacks in documentation.

Your contribution

Im happy to raise a PR for an example notebook.

BenjaminBossan commented 1 month ago

Thanks for offering to create a notebook to show LM evaluation harness applied to a PEFT model. Examples like this are always welcome.

Just for my understanding, as I don't have experience with this package: What would be involved in this? Skimming their README, I could see that they already provide a script to evaluation HF models:

https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#hugging-face-transformers

Probably that should also work with local models, not just models hosted on the Hub. Would this already be sufficient to evaluate PEFT models or are more steps needed, which would be useful to have in a notebook?

JINO-ROHIT commented 1 month ago

yeap, so its more or less the same.

evaluating with peft is here in the advanced usage section - https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#advanced-usage-tips

I thought itd be nice to have an example notebook with the workflow -

take a base model and check metrics on some task.
finetune with some peft method.
load this finetuned peft and eval again on the same task and check performance.

evalauting using lm-harness is quite controlled and stable and maintains consistent results. I wasnt able to find this type of workflow over the internet for peft models. If this workflow sits within the purview of the repo, we can have this. WDYT

BenjaminBossan commented 1 month ago

I see, thanks for providing more details. I agree that this would be a great addition to have in PEFT.

JINO-ROHIT commented 1 month ago

cool, will start working on this notebook

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan commented 2 days ago

Resolved via #2190.