Open jeff3071 opened 1 year ago
Thanks for the result. @jeff3071 can you share your evaluation code?
You can refer to my code at https://github.com/open-evals/evals .
Hi @jeff3071 ! Here are some results from other datasets in openai, for your information : model | crepe(100) include |
born-first(122) match |
anagrams(357) match |
balance-chemical-equation(100) match |
bigrams(200) match |
---|---|---|---|---|---|
alpaca-lora-7b | 0.2 | 0.5 | 0 | 0 | 0 |
gpt-3.5-turbo | 0.45 | 0.64 | 0.29 | 0.31 | 0.18 |
text-davinci-003 | 0.19 | 0.49 | 0.199 | 0.07 | 0.595 |
We evaluate llama using 100 examples of the
SQuAD
dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude
accuracy as a metric to measure its performance.