Evaluation Metric - Githubissues

gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated

Apache License 2.0

1.46k stars 146 forks source link

Evaluation Metric #44

Open gururise opened 1 year ago

gururise commented 1 year ago

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

SquaD Dataset - F1 Score
WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

gururise commented 1 year ago

So I've compared two different alpaca 7b models on the Squad Dataset:

dataset	model	Squad(Mini) F1
Original Alpaca	samwit/alpaca7B-lora	34.63
Cleaned Alpaca	tloen/alpaca-lora-7b	49.64

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

claysauruswrecks commented 1 year ago

I have mentioned a few options previously in this issue: https://github.com/tloen/alpaca-lora/issues/147

gururise commented 1 year ago

Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.

gururise commented 1 year ago

Just added piqa benchmark, also redid the scoring of the Squad bench:

dataset	Hugging Face	parameters	SquadMini (f1)	Piqa (acc)
Original Alpaca	samwit/alpaca7B-lora	7b	74.271	~~50.5~~
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b	75.629	~~54.0~~
Cleaned Alpaca (Mar 31)	yahma/alpaca-7b-lora	7b	76.388	~~52.6~~
GPT4All	nomic-ai/gpt4all-lora	7b	72.643	~~49.5~~

Note: PIQA benchmark has issues. Do not use it yet.

gururise commented 1 year ago

Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:

Dataset	Model	parameters	WikiText (ppl)	MNLI (acc)	Piqa (acc norm)
Original Alpaca	samwit/alpaca7B-lora	7b (lora)	9.5396	38.33	78.51
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b (lora)	9.4885	51.6	79.33
GPT4All	nomic-ai/gpt4all-lora	7b (lora)	10.09	38.97	78.40

Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.

YukinoshitaKaren commented 1 year ago

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

gururise commented 1 year ago

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.

claysauruswrecks commented 1 year ago

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf