tianyi-lab / Reflection_Tuning

[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
336 stars 29 forks source link

How to evalueate the model of reflection tuning #1

Closed sglucas closed 1 year ago

sglucas commented 1 year ago

Hi

Thank you very much for your great work! May I ask how to evaluate the reflection tuning model on the tasks such as MMLU

Best Lucas

MingLiiii commented 1 year ago

Hi, thanks for your interest! We use the code from https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard for evaluation on benchmarks like MMLU.

sglucas commented 1 year ago

Thanks for your reply and I will try it.

sglucas commented 1 year ago

Dear Ming

Thanks for you help again!

I have no too much experience about instruction tuning. I try to fine-tune llama with alpaca dataset and try to evaluate it. The accuracy is about 39.6, which is lower than the ACC about 41.6 ofchavinlo/alpaca-native in the leaderboard and you paper report. I'm not sure whether I need to try multiple times with different seeds.

Also, I just use 2 gpus to fine-tuning (vanilla alpaca is 4 gpus), which means that I need to use more gradient accumulation steps. I'm not sure whether that can be the main reason.

Thank you very much in advance.

Best Lucas

MingLiiii commented 1 year ago

Hi, can you provide me with more details about your model? For example, what code base do you use? How long do you set the max token length? Your lr and epochs? what's your prompt template?

And how do you run the MMLU evaluation code? Can you share your scripts?

I think it is probably because of the randomness if your hyperparameters are totally the same as the original alpaca settings. Although I think the gap should not be as large as yours...

sglucas commented 1 year ago

Hi Ming,

Thanks for your reply!

The code base is vanilla alpaca: https://github.com/tatsu-lab/stanford_alpaca

I think max token length is the default value 512

I try to run the vanilla alpaca code:

CUDA_VISIBLE_DEVICES=6,7 torchrun --nproc_per_node=2 --master_port=3045 train.py \
    --model_name_or_path /data/llama-7B-hf/ \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir baseline_2e5 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

I use lm-evaluation-harness to evaluate it on MMLU:

python main.py \
    --model hf-causal \
    --model_args pretrained=/data/code/stanford_alpaca/baseline_2e5 \
    --tasks hendrycksTest-* \
    --device cuda:6 

Best Lucas

MingLiiii commented 1 year ago

Ah, I think it is because of your MMLU test script, leave me an email address, and I can send you the script.

sglucas commented 1 year ago

Thanks for your reply.

My email is: lucasliunju@gmail.com

Best Lucas