Evaluation reproducibility on benchmarks

Cheungki commented 9 months ago

Nice work!

I'm trying to sft Llama 2 on cherry_data_v1 under the same setting as this repo. Also, following the settings in your original paper, I use the lm-evaluation-harness framework to evaluate the fine-tuned models on mmlu,ai2_arc,hellaswag,truthfulqa these four benchmarks. But there is a slight performance gap between mine and what you reported in this repo.

What might lead to this? or can you share the scripts used in lm-evaluation-harness for evaluation?

BTW, the data splits in cherry_data_v1 are not exactly 5%, 10%, or 15% of the amount of original Stanford-Alpaca.

MingLiiii commented 9 months ago

Thanks for your interest!

Below are some questions for me to know what is going on:

What codebase are you using for sft?
Can you specifically let me know your training setting?
How much exactly is the slight performance gap?
I don't think we showcase the performance for llama2 fine-tuning on cherry_data_v1, so which list of performance are you comparing your model with?
Indeed previously there was one bro using the incorrect scripts. Can you send me your scripts used in lm-evaluation-harness for evaluation?

As for "data splits in cherry_data_v1 are not exactly 5%, 10%, or 15%", yes indeed. It is caused by our previous filtering mechanism. Thus in our paper, we all present "approximately 5% data" or so and present the specific data number in the paper.

Cheungki commented 9 months ago

Thx for your quick reply.

I used llama-factory for sft.
I train these three models under the same settings as yours: batch_size 128, learning_rate 2e-5, num_train_epochs 3, warmup_ratio 0.03, max_length 2048. Due to hardware limitations, I use fp16 rather than bf16 in my training.
As for the performance gap, please check the attached files below. cherry_5percent.json cherry_10percent.json cherry_15percent.json
I'm comparing mine with the results you report here.
I used CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29999 -m lm_eval --model hf --model_args pretrained={sft_model_path} --tasks mmlu,ai2_arc,hellaswag,truthfulqa --batch_size 1 --output_path {log_path} in 'lm-evaluation-harness' for evaluation.

MingLiiii commented 9 months ago

Thanks for your reply~

Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.
I think the settings are the same as ours.
N/A
We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data. Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b. You might need to sort the data by yourself.
It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.

To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".

Cheungki commented 9 months ago

ic, I'll try this later. Thx again.

daidaiershidi commented 6 days ago

Thanks for your reply~

Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.

I think the settings are the same as ours.

N/A

We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data. Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b. You might need to sort the data by yourself.

It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.

To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".

hi, I notice that the new datasets format doesn't seem to match the one in this repository?

The format in the repository is

https://github.com/tianyi-lab/Cherry_LLM/blob/c02cd9d0b7910fb6ad720a3770da50e5c501afdb/cherry_seletion/data_analysis.py#L154

The format of the new dataset is

{ "loss": [ 3.750539541244507, 1.551876187324524, 2.1659064292907715, 1.1238280534744263 ], "ppl": [ 42.544029235839844, 4.720317840576172, 8.722504615783691, 3.0766091346740723 ] }

How should I use the new dataset?

daidaiershidi commented 6 days ago

ic, I'll try this later. Thx again.

hi, may I ask if you have successfully reproduced a similar result?

MingLiiii commented 6 days ago

Thanks for your reply~

Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.

I think the settings are the same as ours.

N/A

We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data. Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b. You might need to sort the data by yourself.

It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.

To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".

hi, I notice that the new datasets format doesn't seem to match the one in this repository?

The format in the repository is

https://github.com/tianyi-lab/Cherry_LLM/blob/c02cd9d0b7910fb6ad720a3770da50e5c501afdb/cherry_seletion/data_analysis.py#L154

The format of the new dataset is
{ "loss": [ 3.750539541244507, 1.551876187324524, 2.1659064292907715, 1.1238280534744263 ], "ppl": [ 42.544029235839844, 4.720317840576172, 8.722504615783691, 3.0766091346740723 ] }
How should I use the new dataset?

Firstly, thanks for your interest in our work! Sorry for the inconvenience, the code for this part is in another repo Reflection-Tuning. We expand the previous loss-only IFD values to loss/ppl and IFD/rIFD values.

temp_data_i['ppl'] = [ppl_ins_alone,ppl_out_alone,ppl_ins_condition,ppl_out_condition]
temp_data_i['loss'] = [loss_ins_alone,loss_out_alone,loss_ins_condition,loss_out_condition]

tianyi-lab / Cherry_LLM

Evaluation reproducibility on benchmarks #18