Open gururise opened 1 year ago
So I've compared two different alpaca 7b models on the Squad Dataset:
dataset | model | Squad(Mini) F1 |
---|---|---|
Original Alpaca | samwit/alpaca7B-lora | 34.63 |
Cleaned Alpaca | tloen/alpaca-lora-7b | 49.64 |
At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.
I have mentioned a few options previously in this issue: https://github.com/tloen/alpaca-lora/issues/147
Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.
Just added piqa benchmark, also redid the scoring of the Squad bench:
dataset | Hugging Face | parameters | SquadMini (f1) | Piqa (acc) |
---|---|---|---|---|
Original Alpaca | samwit/alpaca7B-lora | 7b | 74.271 | |
Cleaned Alpaca (Mar 26) | tloen/alpaca-lora-7b | 7b | 75.629 | |
Cleaned Alpaca (Mar 31) | yahma/alpaca-7b-lora | 7b | 76.388 | |
GPT4All | nomic-ai/gpt4all-lora | 7b | 72.643 |
Note: PIQA benchmark has issues. Do not use it yet.
Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:
Dataset | Model | parameters | WikiText (ppl) | MNLI (acc) | Piqa (acc norm) |
---|---|---|---|---|---|
Original Alpaca | samwit/alpaca7B-lora | 7b (lora) | 9.5396 | 38.33 | 78.51 |
Cleaned Alpaca (Mar 26) | tloen/alpaca-lora-7b | 7b (lora) | 9.4885 | 51.6 | 79.33 |
GPT4All | nomic-ai/gpt4all-lora | 7b (lora) | 10.09 | 38.97 | 78.40 |
Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.
May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.
May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.
The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.
Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.
Going to focus on these two datasets for evaluation:
I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.