Open zanqi opened 6 days ago
cc @SunMarc
Hey @zanqi, thanks for the report. I am unable to reproduce your results ? Could you share a minimal reproducer ? I get the following result in my case:
As you can see, we do have a warm up phase then the cosine decay. in the photo you shared, the wramup doesn't seems to be linear also.
I've used the following script : https://github.com/SunMarc/minimal-trainer-zoo/blob/main/causal_language_modeling.py with these args:
training_args = TrainingArguments(
output_dir="results/causal_language_modeling", # Where weights are stored
learning_rate=1e-5, # The learning rate during training
per_device_train_batch_size=8, # Number of samples per batch during training
per_device_eval_batch_size=8, # Number of samples per batch during evaluation
num_train_epochs=10, # How many iterations through the dataloaders should be done
weight_decay=0, # Regularization penalization
evaluation_strategy="epoch", # How often metrics on the evaluation dataset should be computed
save_strategy="epoch", # When to try and save the best model (such as a step number or every iteration)
lr_scheduler_type="cosine",
report_to="wandb",
warmup_ratio= 0.03,
logging_steps=1, # to log every steps, otherwise we log every 500 steps
)
I am using these steps:
sh -x VILA/scripts/v1_5/ft/train_xyxy.slurm
The script on step three requires a dataset in LLaVA format: https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md
I haven't pushed my dataset. This page has the steps to download one: https://wandb.ai/byyoung3/ml-news/reports/How-to-Fine-Tune-LLaVA-on-a-Custom-Dataset--Vmlldzo2NjUwNTc1
These three lines in train_xyxy.slurm should be changed to point to the dataset.
--data_path ../LLaVA/armbench/train/dataset_xyxy_sorted.json \
--validation_data_path ../LLaVA/armbench/validation/dataset_xyxy_sorted.json \
--image_folder ../LLaVA/armbench/images/ \
I found the issue comes from the deepspped zero3_offload.json file used by my command. It has these lines:
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
They override the scheduler type I set in the command line. Removing them seems to fix the problem. I don't know how deepspeed wrap arround huggingface trainer. If you have some info on this, it would be helpful for future reference.
System Info
NA
Who can help?
@muellerz @sunma
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I ran the following command to start a training job, but the learning rate do not decay as expected. Do I need to change any param to make the cosine schedule work?
Expected behavior
LR decay as a cosine curve