Open HuangOwen opened 1 year ago
Same issue here. Could you share your args for generate?
Same issue here. Could you share your args for generate?
Hi sry for the late reply my args are
temperature = 0.8 top_p = 0.95 max_seq_len = 512 max_batch_size = 1
The few-shot prompt is from https://github.com/kojima-takeshi188/zero_shot_cot
Here is my args, but I only get 1.5% ACC for gsm8k. (btw, I am using model from huggingface)
response = self.model.generate(input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=512,
top_p=0.95,
temperature=0.8)
response = \
self.tokenizer.decode(response[0][input_ids.shape[1]:],
skip_special_tokens=True)
And there is a lot of repetition in the response:
Q: Alisa biked 12 miles per hour for 4.5 hours. Stanley biked at 10 miles per hour for 2.5 hours. How many miles did Alisa and Stanley bike in total?
A:Alisa biked 12 miles per hour for 4.5 hours. So she biked 12 * 4.5 = 54 miles. Stanley biked 10 miles per hour for 2.5 hours. So he biked 10 * 2.5 = 25 miles. So in total they biked 54 + 25 = 79 miles. The answer is 79.
Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.
Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.
Q: There are 100 students in the class. 20 students are absent. How many students are in the class?
A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.
Here is my args, but I only get 1.5% ACC for gsm8k. (btw, I am using model from huggingface)
response = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=512, top_p=0.95, temperature=0.8) response = \ self.tokenizer.decode(response[0][input_ids.shape[1]:], skip_special_tokens=True)
And there is a lot of repetition in the response:
Q: Alisa biked 12 miles per hour for 4.5 hours. Stanley biked at 10 miles per hour for 2.5 hours. How many miles did Alisa and Stanley bike in total? A:Alisa biked 12 miles per hour for 4.5 hours. So she biked 12 * 4.5 = 54 miles. Stanley biked 10 miles per hour for 2.5 hours. So he biked 10 * 2.5 = 25 miles. So in total they biked 54 + 25 = 79 miles. The answer is 79. Q: There are 100 students in the class. 20 students are absent. How many students are in the class? A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80. Q: There are 100 students in the class. 20 students are absent. How many students are in the class? A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80. Q: There are 100 students in the class. 20 students are absent. How many students are in the class? A: There are 100 students originally. 20 students are absent. So 100 - 20 = 80. The answer is 80.
The hugging face model has no difference from meta's llama model. Are you using few-shot cot or zero-shot cot? Repetition is not a problem as long as you have a correct answer extractor (see https://github.com/kojima-takeshi188/zero_shot_cot/blob/main/utils.py) and I also notice this effect of repetition of LLaMa. Vanilla LLaMa has not been fine-tuned with instructions and can only do completion.
Thanks for your quick response: I modify the answer_cleaning:
def answer_cleansing(args, pred):
print("pred_before : " + pred)
if args.method in ("few_shot", "few_shot_cot"):
pred = pred.lower()
split = args.direct_answer_trigger_for_fewshot.lower()
preds = pred.split(split)
answer_flag = True if len(preds) > 1 else False
if answer_flag:
pred = preds[1]
else:
pred = preds[-1]
...
The ACC seems to be normal now for about 7%.
Thanks for your quick response: I modify the answer_cleaning:
def answer_cleansing(args, pred): print("pred_before : " + pred) if args.method in ("few_shot", "few_shot_cot"): pred = pred.lower() split = args.direct_answer_trigger_for_fewshot.lower() preds = pred.split(split) answer_flag = True if len(preds) > 1 else False if answer_flag: pred = preds[1] else: pred = preds[-1] ...
The ACC seems to be normal now for about 7%.
Yes, I am also acquiring ~7%, which is still a bit lower than Meta's report. If you have any ideas to improve the 8-shots results to 11% please let me know.
You can modify answer clean like this:
def clean_answer(model_pred):
model_pred = model_pred.lower()
preds = model_pred.split(ANSWER_TRIGGER.lower())
answer_flag = True if len(preds) > 1 else False
if answer_flag:
# Pick first answer with flag
pred = preds[1]
else:
# Pick last number without flag
pred = preds[-1]
pred = pred.replace(",", "")
pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)]
if len(pred) == 0:
return INVALID_ANS
if answer_flag:
# choose the first element in list
pred = pred[0]
else:
# choose the last element in list
pred = pred[-1]
# (For arithmetic tasks) if a word ends with period, it will be omitted ...
if pred[-1] == ".":
pred = pred[:-1]
return pred
which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.
You can modify answer clean like this:
def clean_answer(model_pred): model_pred = model_pred.lower() preds = model_pred.split(ANSWER_TRIGGER.lower()) answer_flag = True if len(preds) > 1 else False if answer_flag: # Pick first answer with flag pred = preds[1] else: # Pick last number without flag pred = preds[-1] pred = pred.replace(",", "") pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)] if len(pred) == 0: return INVALID_ANS if answer_flag: # choose the first element in list pred = pred[0] else: # choose the last element in list pred = pred[-1] # (For arithmetic tasks) if a word ends with period, it will be omitted ... if pred[-1] == ".": pred = pred[:-1] return pred
which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.
Thanks for the hints. I try this new clean_answer() script but the accuracy is still ~7% for llama-7b 8-shot. Could you please share more about the generation, are you using https://github.com/kojima-takeshi188/zero_shot_cot ?
You can modify answer clean like this:
def clean_answer(model_pred): model_pred = model_pred.lower() preds = model_pred.split(ANSWER_TRIGGER.lower()) answer_flag = True if len(preds) > 1 else False if answer_flag: # Pick first answer with flag pred = preds[1] else: # Pick last number without flag pred = preds[-1] pred = pred.replace(",", "") pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)] if len(pred) == 0: return INVALID_ANS if answer_flag: # choose the first element in list pred = pred[0] else: # choose the last element in list pred = pred[-1] # (For arithmetic tasks) if a word ends with period, it will be omitted ... if pred[-1] == ".": pred = pred[:-1] return pred
which takes the first answer as the model answer will help you to get ~11.41% ACC. Btw, I used a modified tokenizer from alpaca team.
Thanks for the hints. I try this new clean_answer() script but the accuracy is still ~7% for llama-7b 8-shot. Could you please share more about the generation, are you using https://github.com/kojima-takeshi188/zero_shot_cot ?
My generation args are:
generate_kwargs = dict(max_new_tokens=512, top_p=0.95, temperature=0.8)
Here is my evaluation code which is built in an FL framework: https://github.com/alibaba/FederatedScope/blob/dev/llm/federatedscope/llm/eval/eval_for_gsm8k/eval.py
Have you tried to modify tokenizer following alpaca team?
Same issue here. Thanks for your information! How about performance on 13b, 30b and 65b?
Hi @rayrayraykk , thanks for the information you shared!
But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right?
If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!
Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!
I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.
tokenizer = AutoTokenizer.from_pretrained(
model_name,
cache_dir=cache_dir,
model_max_length=tok_len,
padding_side="right",
use_fast=False,
)
special_tokens = dict()
if tokenizer.pad_token is None:
special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value
if tokenizer.eos_token is None:
special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value
if tokenizer.bos_token is None:
special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value
if tokenizer.unk_token is None:
special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value
num_new_tokens = tokenizer.add_special_tokens(special_tokens)
Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!
I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.
tokenizer = AutoTokenizer.from_pretrained( model_name, cache_dir=cache_dir, model_max_length=tok_len, padding_side="right", use_fast=False, ) special_tokens = dict() if tokenizer.pad_token is None: special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value if tokenizer.eos_token is None: special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value if tokenizer.bos_token is None: special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value if tokenizer.unk_token is None: special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value num_new_tokens = tokenizer.add_special_tokens(special_tokens)
Thx! Do you have alignment the results on 13b model or other versions?
Hi @rayrayraykk , thanks for the information you shared! But there is one thing I feel a little confused. The modified tokenizer from alpaca team, are you referring to the tokenizer weights that have been modified through alpaca model instruction fine tuning? I understand that if we use the parameters of the model after fine tuning, it wouldn't be the result of the original LLaMa model, right? If we do not use the modified model, could the 8-shots accuracy of GSM8K be consistent with what's reported in the paper? Thx!
I use the llama tokenizer with some special tokens added instead of tokenizer whose weights that have been modified through alpaca model.
tokenizer = AutoTokenizer.from_pretrained( model_name, cache_dir=cache_dir, model_max_length=tok_len, padding_side="right", use_fast=False, ) special_tokens = dict() if tokenizer.pad_token is None: special_tokens["pad_token"] = DefaultToken.PAD_TOKEN.value if tokenizer.eos_token is None: special_tokens["eos_token"] = DefaultToken.EOS_TOKEN.value if tokenizer.bos_token is None: special_tokens["bos_token"] = DefaultToken.BOS_TOKEN.value if tokenizer.unk_token is None: special_tokens["unk_token"] = DefaultToken.UNK_TOKEN.value num_new_tokens = tokenizer.add_special_tokens(special_tokens)
Thx! Do you have alignment the results on 13b model or other versions?
Sry, I've not tried 13b model.
Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:
Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:
- just get the max prob output, no set tempareture and top_k
- for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot. is there anything that helps to improve the acc closer with paper? thanks for your time
I think the problem is that the max_new_tokens
, which should be at least 256 even if the setting is zero-shot. (LLaMa needs to validate and conduct reasoning.) In addition, if you add a cot prefix to the prompt ("Let's think step by step") and do not add any few shot exemplars.. ~5% accuracy seems ok with 7B
Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:
- just get the max prob output, no set tempareture and top_k
- for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot. is there anything that helps to improve the acc closer with paper? thanks for your time
To be more specific, the setting of the original llama paper is the few-shot cot (8-shot, I guess). You could not reproduce the results or improve the accuracy closer with a zero-shot setting.
@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?
Hi @rayrayraykk @HuangOwen , i just follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot to evaluate the model from hugging face,but, i just got 5% acc with 7B in zero_shot mode, my param setting is:
- just get the max prob output, no set tempareture and top_k
- for zero_shot, i just set max_new_tokens to 16, i think it don's need 512 seq len for zero_shot. is there anything that helps to improve the acc closer with paper? thanks for your time
To be more specific, the setting of the original llama paper is the few-shot cot (8-shot, I guess). You could not reproduce the results or improve the accuracy closer with a zero-shot setting.
thanks for your quick reply , i have another question in other dialog box。
@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?
Yes, you can achieve ~11% accuracy with an 8-shot cot. Remember to add special tokens of alpaca, which is mentioned above. The maj1@k indicates "majority voting", which means that you generate the answer with the same question for k times and use majority voting to select the answer.
@HuangOwen,so, you can get 11% acc by 8-shot-cot? i will try it now. in llama paper, it gives two type results, GSM8K && GSM8k+maj1@k, the sencod type will use k samples with k = 40, that means 40 shot? do you have any idea?
Yes, you can achieve ~11% accuracy with an 8-shot cot. Remember to add special tokens of alpaca, which is mentioned above. The maj1@k indicates "majority voting", which means that you generate the answer with the same question for k times and use majority voting to select the answer.
what is DefaultToken.PAD_TOKEN.value here?
Hi @HuangOwen, do you have the script you used to reproduce llama-1 results on GSM8k? If you have done something similar for llama-2, it would be great! Thanks
I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above).
Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4
I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.
Also I'm using windows and getting the following warning:
"do_sample
is set fo False. However, temperature
is set to 0.8 -- this flag is only used in sample-based gneration models. You should set do_sample=True
or unset temperature
.
I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above).
Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4
I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue.
Also I'm using windows and getting the following warning: "
do_sample
is set fo False. However,temperature
is set to 0.8 -- this flag is only used in sample-based gneration models. You should setdo_sample=True
or unsettemperature
.
I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above). Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4 I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue. Also I'm using windows and getting the following warning: "
do_sample
is set fo False. However,temperature
is set to 0.8 -- this flag is only used in sample-based gneration models. You should setdo_sample=True
or unsettemperature
.
- close the beam and use do_sample = True
- lower temperature means better result actually, you shall try temperature = 0.1, for example
- gpt3.5-turbo for extracting answer is unnecessary, you can just find the last number (include fraction, int, float) to act as your final output.
- it looks like the 8-shot is a classical setting, why you use 4-shot ?
- in my setting, zero-shot-cot of llama1 on gsmtest is 7% approximately, while 8-shot-cot is 12%.
Thanks for the reply!
My tests are still running, I've sampled ~200 answers that give ~5% on the 4-shot prompt. I seriously doubt it will make 12% after it is finished. Note that for each question I randomly sample 4 prompts to compose the 4-shot prompt. My max_new_tokens
is set to 256.
I'm also having issues with this. So far my performance on gsm8k using llama-7B is about 5%, very far from 11%. I'm using few-shot prompting with 4 prompts (the same found on the repo above). Model: decapoda-research/llama-7b-hf temperature=0.8 num_beams=4 I'm using gpt3.5-turbo to extract the numeric value from the model's response, and it works perfectly, so answer extraction is not the issue. Also I'm using windows and getting the following warning: "
do_sample
is set fo False. However,temperature
is set to 0.8 -- this flag is only used in sample-based gneration models. You should setdo_sample=True
or unsettemperature
.
- close the beam and use do_sample = True
- lower temperature means better result actually, you shall try temperature = 0.1, for example
- gpt3.5-turbo for extracting answer is unnecessary, you can just find the last number (include fraction, int, float) to act as your final output.
- it looks like the 8-shot is a classical setting, why you use 4-shot ?
- in my setting, zero-shot-cot of llama1 on gsmtest is 7% approximately, while 8-shot-cot is 12%.
Thanks for the reply!
- I Will try that.
- Is that true for reasoning tasks?
- llama7b is messy and outputs crazy stuff, so I'm using gpt api just to make sure.
- Memory limits, I'm running on a rtx 3060, I'll try increasing to 6.
- can you give an example of a prompt for both cases? For the zero-shot-cot do you just add "let think in steps" after the question?
My tests are still running, I've sampled ~200 answers that give ~5% on the 4-shot prompt. I seriously doubt it will make 12% after it is finished. Note that for each question I randomly sample 4 prompts to compose the 4-shot prompt. My
max_new_tokens
is set to 256.
max_new_tokens 512 maybe better even though math answers are usually short.
I agree with the fact that reasoning tasks fit in a lower temperature, but in general setting (QA, creating), temperature for llama/vicuna is 0.7.
prompt for zero-shot-cot like : "How many bolts in total does it take? Let's think step by step."
prompt for 8-shot-cot like : "USER: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? ASSISTANT: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9USER: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? ASSISTANT: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5USER: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny? ASSISTANT: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8USER: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room? ASSISTANT: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29USER: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? ASSISTANT: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39USER: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday? ASSISTANT: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33USER: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? ASSISTANT: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6USER: Olivia has $23. She bought five bagels for $3 each. How much money does she have left? ASSISTANT: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8USER: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? ASSISTANT:" (just ignore the USER, ASSISTANT and replaca it to your role setting, like Q, A for example)
for 3, llama7b prone to output repetitions and you can add repetition_penalty during generation actually,
finally, recommend to use VLLM for accelerating your generation although the acc may get lower.
Based on your suggestions I'm running an experiment with temperature=0.1
, repetition_penalty=1.2
, do_sample=True
and top_p=0.75
(num_beams is commented out). I'm also using 8-shot prompting with max_new_tokens = 512
.
Well... it is prepending all the 8 prompt examples before the main answer and sometimes it adds stuff like "### ADVANCED..." after the answer. I can't see how one would extract the answer from this mess, even chatgpt may suffer, since the prompts already contain "the answer is x". After the first 100 answers I ran the accuracy script, it got only 1 correct...so I don't really know what I'm doing wrong.
Any ideas? Commenting out num_beams seems to free some memory, since I can now use the 8-shot (I'm not sure what this parameter does).
Based on your suggestions I'm running an experiment with
temperature=0.1
,repetition_penalty=1.2
,do_sample=True
andtop_p=0.75
(num_beams is commented out). I'm also using 8-shot prompting withmax_new_tokens = 512
.Well... it is prepending all the 8 prompt examples before the main answer and sometimes it adds stuff like "### ADVANCED..." after the answer. I can't see how one would extract the answer from this mess, even chatgpt may suffer, since the prompts already contain "the answer is x". After the first 100 answers I ran the accuracy script, it got only 1 correct...so I don't really know what I'm doing wrong.
Any ideas? Commenting out num_beams seems to free some memory, since I can now use the 8-shot (I'm not sure what this parameter does).
what about using zero_shot_cot and see the result.
I'm running some experiments. I found that using top_k=1
with top_p=0
gives a more stable model. So I fixed those and I'm varying max_new_tokens
and n
in n-few-shot. So far best performance is 5.2% with n=4 (I'm running n=8, not seeing too much improvement).
What is the tokenizer thing that is discussed here? My tokenizer setup is just:
tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
I got 7.5% with
temperature=0.1
top_p = 0,
top_k = 1,
max_new_tokens=1*512,
repetition_penalty=1.2
How can I setup the special tokenizer mentioned here? What is DefaultToken?
Hi, if I understand correctly, is it that we have to take the last token in the generated prompt and assume that to be the answer?
Or rather, we check the last element of the list generated by pred = [s for s in re.findall(r'-?\d+\.?\d*', pred)]
if it doesn't have a period at the end?
I still can't find the prompt anywhere, does anyone know what the prompt is?
@surya-narayanan You can do a simple regex search as done in llm harness
For the prompt you can refer the same for the 8 shot prompt - https://github.com/EleutherAI/lm-evaluation-harness/blob/ae79b1217aad7738b91e88a4017c86a5d5e45aa7/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L8
Hi, I am trying to reproduce the LLaMa on the GSM8K dataset. I basically follow this repo: https://github.com/kojima-takeshi188/zero_shot_cot. However, the performance across LLaMa-7B/13B/30B is far from the paper's result. I can only get 7.13 for an 8-shot with LLaMa-7B. May I know if anyone has reproduced the results and what is the prompt you are using?