Sharing evaluation results

Ethan-Wu-juniper commented 1 year ago

Hi! We're interested in how different LLMs perform on different datasets. Here are some results based on alpaca-lora-7b and other OpenAI's public models : https://github.com/juncongmoo/pyllama/issues/45

Ethan-Wu-juniper commented 1 year ago

Here's a tidied up result:

We evaluate llama using 100 examples of the SQuAD and the other datasets with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.

For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])

model	squad(100)	crepe(100) `include`	born-first(122) `match`	anagrams(357) `match`	balance-chemical-equation(100) `match`	bigrams(200) `match`
alpaca-lora-7b	0.88	0.2	0.5	0	0	0
llama-7b	0.63
gpt-3.5-turbo	0.9	0.45	0.64	0.29	0.31	0.18
text-davinci-003	0.87	0.19	0.49	0.199	0.07	0.595
text-davinci-002	0.66
text-davinci-001	0.58
ada	0.35

ndvbd commented 1 year ago

Nice.

Just to be sure - you take the base models without further tuning, right?
Since you do not finetune, the 'prompt' is very important, so maybe you can write how you engineered the prompt.

tloen / alpaca-lora

Sharing evaluation results #124