Closed sglucas closed 1 year ago
Hi, thanks for your interest! We use the code from https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard for evaluation on benchmarks like MMLU.
Thanks for your reply and I will try it.
Dear Ming
Thanks for you help again!
I have no too much experience about instruction tuning. I try to fine-tune llama with alpaca dataset and try to evaluate it. The accuracy is about 39.6, which is lower than the ACC about 41.6 ofchavinlo/alpaca-native in the leaderboard and you paper report. I'm not sure whether I need to try multiple times with different seeds.
Also, I just use 2 gpus to fine-tuning (vanilla alpaca is 4 gpus), which means that I need to use more gradient accumulation steps. I'm not sure whether that can be the main reason.
Thank you very much in advance.
Best Lucas
Hi, can you provide me with more details about your model? For example, what code base do you use? How long do you set the max token length? Your lr and epochs? what's your prompt template?
And how do you run the MMLU evaluation code? Can you share your scripts?
I think it is probably because of the randomness if your hyperparameters are totally the same as the original alpaca settings. Although I think the gap should not be as large as yours...
Hi Ming,
Thanks for your reply!
The code base is vanilla alpaca: https://github.com/tatsu-lab/stanford_alpaca
I think max token length is the default value 512
I try to run the vanilla alpaca code:
CUDA_VISIBLE_DEVICES=6,7 torchrun --nproc_per_node=2 --master_port=3045 train.py \
--model_name_or_path /data/llama-7B-hf/ \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir baseline_2e5 \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 32 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
I use lm-evaluation-harness
to evaluate it on MMLU:
python main.py \
--model hf-causal \
--model_args pretrained=/data/code/stanford_alpaca/baseline_2e5 \
--tasks hendrycksTest-* \
--device cuda:6
Best Lucas
Ah, I think it is because of your MMLU test script, leave me an email address, and I can send you the script.
Thanks for your reply.
My email is: lucasliunju@gmail.com
Best Lucas
Hi
Thank you very much for your great work! May I ask how to evaluate the reflection tuning model on the tasks such as MMLU
Best Lucas