juncongmoo / pyllama

LLaMA: Open and Efficient Foundation Language Models
GNU General Public License v3.0
2.8k stars 311 forks source link

Share your evaluate result #45

Open jeff3071 opened 1 year ago

jeff3071 commented 1 year ago

We evaluate llama using 100 examples of the SQuAD dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.

For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])

model squad(100)
alpaca-lora-7b 0.88
llama-7b 0.63
gpt-3.5-turbo 0.9
text-davinci-003 0.87
text-davinci-002 0.66
text-davinci-001 0.58
ada 0.35
vo2021 commented 1 year ago

Thanks for the result. @jeff3071 can you share your evaluation code?

jeff3071 commented 1 year ago

You can refer to my code at https://github.com/open-evals/evals .

Ethan-Wu-juniper commented 1 year ago
Hi @jeff3071 ! Here are some results from other datasets in openai, for your information : model crepe(100) include born-first(122) match anagrams(357) match balance-chemical-equation(100) match bigrams(200) match
alpaca-lora-7b 0.2 0.5 0 0 0
gpt-3.5-turbo 0.45 0.64 0.29 0.31 0.18
text-davinci-003 0.19 0.49 0.199 0.07 0.595