gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Evaluation Metric #44

Open gururise opened 1 year ago

gururise commented 1 year ago

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

  1. SquaD Dataset - F1 Score
  2. WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

gururise commented 1 year ago

So I've compared two different alpaca 7b models on the Squad Dataset:

dataset model Squad(Mini) F1
Original Alpaca samwit/alpaca7B-lora 34.63
Cleaned Alpaca tloen/alpaca-lora-7b 49.64

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

claysauruswrecks commented 1 year ago

I have mentioned a few options previously in this issue: https://github.com/tloen/alpaca-lora/issues/147

gururise commented 1 year ago

Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.

gururise commented 1 year ago

Just added piqa benchmark, also redid the scoring of the Squad bench:

dataset Hugging Face parameters SquadMini (f1) Piqa (acc)
Original Alpaca samwit/alpaca7B-lora 7b 74.271 50.5
Cleaned Alpaca (Mar 26) tloen/alpaca-lora-7b 7b 75.629 54.0
Cleaned Alpaca (Mar 31) yahma/alpaca-7b-lora 7b 76.388 52.6
GPT4All nomic-ai/gpt4all-lora 7b 72.643 49.5

Note: PIQA benchmark has issues. Do not use it yet.

gururise commented 1 year ago

Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:

Dataset Model parameters WikiText (ppl) MNLI (acc) Piqa (acc norm)
Original Alpaca samwit/alpaca7B-lora 7b (lora) 9.5396 38.33 78.51
Cleaned Alpaca (Mar 26) tloen/alpaca-lora-7b 7b (lora) 9.4885 51.6 79.33
GPT4All nomic-ai/gpt4all-lora 7b (lora) 10.09 38.97 78.40

Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.

YukinoshitaKaren commented 1 year ago

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

gururise commented 1 year ago

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.

claysauruswrecks commented 1 year ago

image

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf