Closed Ethan-Wu-juniper closed 4 months ago
Here's a tidied up result:
We evaluate llama using 100 examples of the SQuAD and the other datasets with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.
For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])
model | squad(100) | crepe(100) include |
born-first(122) match |
anagrams(357) match |
balance-chemical-equation(100) match |
bigrams(200) match |
---|---|---|---|---|---|---|
alpaca-lora-7b | 0.88 | 0.2 | 0.5 | 0 | 0 | 0 |
llama-7b | 0.63 | |||||
gpt-3.5-turbo | 0.9 | 0.45 | 0.64 | 0.29 | 0.31 | 0.18 |
text-davinci-003 | 0.87 | 0.19 | 0.49 | 0.199 | 0.07 | 0.595 |
text-davinci-002 | 0.66 | |||||
text-davinci-001 | 0.58 | |||||
ada | 0.35 |
Nice.
Hi! We're interested in how different LLMs perform on different datasets. Here are some results based on alpaca-lora-7b and other OpenAI's public models : https://github.com/juncongmoo/pyllama/issues/45