Closed gururise closed 1 year ago
usage: eval.py [-h] -b BASE_MODEL [-l LORA_WEIGHTS] [-d {wikitext,squadmini,squad}] [-q] options: -h, --help show this help message and exit -b BASE_MODEL, --base-model BASE_MODEL Choose the base model -l LORA_WEIGHTS, --lora-weights LORA_WEIGHTS Choose the lora weights (optional) -d {wikitext,squadmini,squad}, --datasets {wikitext,squadmini,squad} Choose Evaluation Dataset. [default = squadmini] -q, --use-8bit Use 8-bit quant
eval.py --base-model decapoda-research/llama-7b-hf --lora-weights samwit/alpaca7B-lora --datasets squadmini
As stated in issue #44, I compared two different alpaca 7b models on the Squad Dataset:
On SQUAD "mini": The samwit model gets an avg f1 score of: 34.63 while the tloen model trained on the cleaned dataset gets an avg. f1 score of 49.04
At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.
this is awesome!
That's great, just I believe the title should be changed to "Benchmark Alpaca-llama models" instead of "Benchmark llama models", as it can only be run on 'instruct' language models.
Benchmark llama models on Open Source Datasets
Usage
Datasets
Example
eval.py --base-model decapoda-research/llama-7b-hf --lora-weights samwit/alpaca7B-lora --datasets squadmini
Comparison
As stated in issue #44, I compared two different alpaca 7b models on the Squad Dataset:
On SQUAD "mini": The samwit model gets an avg f1 score of: 34.63 while the tloen model trained on the cleaned dataset gets an avg. f1 score of 49.04
At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.