gururise / AlpacaDataCleaned

Alpaca dataset from Stanford, cleaned and curated
Apache License 2.0
1.46k stars 146 forks source link

Program to evaluate models on Squad and Wikitext #46

Closed gururise closed 1 year ago

gururise commented 1 year ago

Benchmark llama models on Open Source Datasets

Usage

usage: eval.py [-h] -b BASE_MODEL [-l LORA_WEIGHTS] [-d {wikitext,squadmini,squad}] [-q]

options:
  -h, --help            show this help message and exit
  -b BASE_MODEL, --base-model BASE_MODEL
                        Choose the base model
  -l LORA_WEIGHTS, --lora-weights LORA_WEIGHTS
                        Choose the lora weights (optional)
  -d {wikitext,squadmini,squad}, --datasets {wikitext,squadmini,squad}
                        Choose Evaluation Dataset. [default = squadmini]
  -q, --use-8bit        Use 8-bit quant

Datasets

Example

eval.py --base-model decapoda-research/llama-7b-hf --lora-weights samwit/alpaca7B-lora --datasets squadmini

Comparison

As stated in issue #44, I compared two different alpaca 7b models on the Squad Dataset:

dataset model Squad(Mini) F1
Original Alpaca samwit/alpaca7B-lora 34.63
Cleaned Alpaca (Mar 27) tloen/alpaca-lora-7b 49.64
Cleaned Alpaca (Mar 31) yahma/alpaca-7b-lora 55.229

On SQUAD "mini": The samwit model gets an avg f1 score of: 34.63 while the tloen model trained on the cleaned dataset gets an avg. f1 score of 49.04

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

JACKHAHA363 commented 1 year ago

this is awesome!

ndvbd commented 11 months ago

That's great, just I believe the title should be changed to "Benchmark Alpaca-llama models" instead of "Benchmark llama models", as it can only be run on 'instruct' language models.