tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.55k stars 2.21k forks source link

Sharing evaluation results #124

Closed Ethan-Wu-juniper closed 4 months ago

Ethan-Wu-juniper commented 1 year ago

Hi! We're interested in how different LLMs perform on different datasets. Here are some results based on alpaca-lora-7b and other OpenAI's public models : https://github.com/juncongmoo/pyllama/issues/45

Ethan-Wu-juniper commented 1 year ago

Here's a tidied up result:

We evaluate llama using 100 examples of the SQuAD and the other datasets with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.

For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])

model squad(100) crepe(100) include born-first(122) match anagrams(357) match balance-chemical-equation(100) match bigrams(200) match
alpaca-lora-7b 0.88 0.2 0.5 0 0 0
llama-7b 0.63
gpt-3.5-turbo 0.9 0.45 0.64 0.29 0.31 0.18
text-davinci-003 0.87 0.19 0.49 0.199 0.07 0.595
text-davinci-002 0.66
text-davinci-001 0.58
ada 0.35
ndvbd commented 1 year ago

Nice.

  1. Just to be sure - you take the base models without further tuning, right?
  2. Since you do not finetune, the 'prompt' is very important, so maybe you can write how you engineered the prompt.