Closed Cheungki closed 9 months ago
Thanks for your interest!
Below are some questions for me to know what is going on:
As for "data splits in cherry_data_v1 are not exactly 5%, 10%, or 15%", yes indeed. It is caused by our previous filtering mechanism. Thus in our paper, we all present "approximately 5% data" or so and present the specific data number in the paper.
Thx for your quick reply.
batch_size 128, learning_rate 2e-5, num_train_epochs 3, warmup_ratio 0.03, max_length 2048
. Due to hardware limitations, I use fp16 rather than bf16 in my training.CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29999 -m lm_eval --model hf --model_args pretrained={sft_model_path} --tasks mmlu,ai2_arc,hellaswag,truthfulqa --batch_size 1 --output_path {log_path}
in 'lm-evaluation-harness' for evaluation.Thanks for your reply~
To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".
ic, I'll try this later. Thx again.
Thanks for your reply~
- Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.
- I think the settings are the same as ours.
- N/A
- We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data. Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b. You might need to sort the data by yourself.
- It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.
To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".
hi, I notice that the new datasets format doesn't seem to match the one in this repository?
The format in the repository is
The format of the new dataset is
{ "loss": [ 3.750539541244507, 1.551876187324524, 2.1659064292907715, 1.1238280534744263 ], "ppl": [ 42.544029235839844, 4.720317840576172, 8.722504615783691, 3.0766091346740723 ] }
How should I use the new dataset?
ic, I'll try this later. Thx again.
hi, may I ask if you have successfully reproduced a similar result?
Thanks for your reply~
- Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.
- I think the settings are the same as ours.
- N/A
- We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data. Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b. You might need to sort the data by yourself.
- It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.
To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".
hi, I notice that the new datasets format doesn't seem to match the one in this repository?
The format in the repository is
The format of the new dataset is
{ "loss": [ 3.750539541244507, 1.551876187324524, 2.1659064292907715, 1.1238280534744263 ], "ppl": [ 42.544029235839844, 4.720317840576172, 8.722504615783691, 3.0766091346740723 ] }
How should I use the new dataset?
Firstly, thanks for your interest in our work! Sorry for the inconvenience, the code for this part is in another repo Reflection-Tuning. We expand the previous loss-only IFD values to loss/ppl and IFD/rIFD values.
temp_data_i['ppl'] = [ppl_ins_alone,ppl_out_alone,ppl_ins_condition,ppl_out_condition]
temp_data_i['loss'] = [loss_ins_alone,loss_out_alone,loss_ins_condition,loss_out_condition]
Nice work!
I'm trying to sft Llama 2 on cherry_data_v1 under the same setting as this repo. Also, following the settings in your original paper, I use the
lm-evaluation-harness
framework to evaluate the fine-tuned models onmmlu,ai2_arc,hellaswag,truthfulqa
these four benchmarks. But there is a slight performance gap between mine and what you reported in this repo.What might lead to this? or can you share the scripts used in
lm-evaluation-harness
for evaluation?BTW, the data splits in cherry_data_v1 are not exactly 5%, 10%, or 15% of the amount of original Stanford-Alpaca.