Share your evaluate result

jeff3071 commented 1 year ago

We evaluate llama using 100 examples of the SQuAD dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.

For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])

model	squad(100)
alpaca-lora-7b	0.88
llama-7b	0.63
gpt-3.5-turbo	0.9
text-davinci-003	0.87
text-davinci-002	0.66
text-davinci-001	0.58
ada	0.35

vo2021 commented 1 year ago

Thanks for the result. @jeff3071 can you share your evaluation code?

jeff3071 commented 1 year ago

You can refer to my code at https://github.com/open-evals/evals .

Ethan-Wu-juniper commented 1 year ago

Hi @jeff3071 ! Here are some results from other datasets in openai, for your information : model	crepe(100) `include`	born-first(122) `match`	anagrams(357) `match`	balance-chemical-equation(100) `match`	bigrams(200) `match`
alpaca-lora-7b	0.2	0.5	0	0	0
gpt-3.5-turbo	0.45	0.64	0.29	0.31	0.18
text-davinci-003	0.19	0.49	0.199	0.07	0.595

juncongmoo / pyllama

Share your evaluate result #45