Program to evaluate models on Squad and Wikitext

gururise commented 1 year ago

Benchmark llama models on Open Source Datasets

Usage

usage: eval.py [-h] -b BASE_MODEL [-l LORA_WEIGHTS] [-d {wikitext,squadmini,squad}] [-q]

options:
  -h, --help            show this help message and exit
  -b BASE_MODEL, --base-model BASE_MODEL
                        Choose the base model
  -l LORA_WEIGHTS, --lora-weights LORA_WEIGHTS
                        Choose the lora weights (optional)
  -d {wikitext,squadmini,squad}, --datasets {wikitext,squadmini,squad}
                        Choose Evaluation Dataset. [default = squadmini]
  -q, --use-8bit        Use 8-bit quant

Datasets

squad - validation split (10570 Q/A pairs) - returns avg. f1 score
squadmini - same as above, but every 10th element (1057 Q/A pairs) - returns avg. f1 score
wikitext - wikitext-2-raw-v1 test split - returns perplexity

Example

eval.py --base-model decapoda-research/llama-7b-hf --lora-weights samwit/alpaca7B-lora --datasets squadmini

Comparison

As stated in issue #44, I compared two different alpaca 7b models on the Squad Dataset:

dataset	model	Squad(Mini) F1
Original Alpaca	samwit/alpaca7B-lora	34.63
Cleaned Alpaca (Mar 27)	tloen/alpaca-lora-7b	49.64
Cleaned Alpaca (Mar 31)	yahma/alpaca-7b-lora	55.229

On SQUAD "mini": The samwit model gets an avg f1 score of: 34.63 while the tloen model trained on the cleaned dataset gets an avg. f1 score of 49.04

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

JACKHAHA363 commented 1 year ago

this is awesome!

ndvbd commented 11 months ago

That's great, just I believe the title should be changed to "Benchmark Alpaca-llama models" instead of "Benchmark llama models", as it can only be run on 'instruct' language models.

gururise / AlpacaDataCleaned