tianyi-lab / Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models
287 stars 19 forks source link

The training bash script for FastChat is what? #24

Open daidaiershidi opened 6 days ago

daidaiershidi commented 6 days ago

Thank you very much for the work you have brought, which is very helpful for those of us with fewer training resources. I am a newcomer to the field of NLP and am not very familiar with training frameworks (this is my first time training an LLM), and I find that my final results are very poor, even far worse than another reproducer. Here is the script I used:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 --master_port=20001 FastChat/fastchat/train/train_mem.py \
    --model_name_or_path .cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9 \
    --data_path ${pre_data_path} \
    --bf16 True \
    --output_dir ${output_dir } \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

When training the final model:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 --master_port=20001 FastChat/fastchat/train/train_mem.py \
    --model_name_or_path .cache/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9 \
    --data_path ${cherry_data_path} \
    --bf16 True \
    --output_dir ${output_dir } \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

I would like to ask if there is a mistake in my training configuration?

MingLiiii commented 6 days ago

Thank you again for your interest in our work.

I think the script you are using does not have big issues. However, since you are a beginner and this is your first time training an LLM, I recommend you directly try training utilizing the original Alpaca dataset, which is provided in our data folder, and see if the performance is reasonable. By doing this you can get to know if the training pipeline you built is correct or not.

daidaiershidi commented 6 days ago

Thank you again for your interest in our work.

I think the script you are using does not have big issues. However, since you are a beginner and this is your first time training an LLM, I recommend you directly try training utilizing the original Alpaca dataset, which is provided in our data folder, and see if the performance is reasonable. By doing this you can get to know if the training pipeline you built is correct or not.

  | Avg | ARC-challenge | HellaSwag_acc_norm | MMLU | truthfulqa_mc2 -- | -- | -- | -- | -- | -- Alpaca in paper | 50.21 | 42.65 | 76.91 | 41.73 | 39.55 WizardLM-10% in paper | 57.57 | 54.86 | 80.46 | 45.74 | 49.20 my_alpaca | 52.62 | 46.25 | 76.18 | 45.39 | 42.65 my_WizardLM-10% | 52.08 | 46.67 | 74.70 | 40.00 | 46.94

It seems that training with the full alpaca dataset is no problem, but using the 10% WizardLM training llama2-7b from Data and Model Weights V1, the performance is worse