Training Finishes Prematurely after Max Length increases

ujjawalmadan commented 8 months ago

Has anyone else experienced cases where the training finishes early as max length increases?

Ran this script on a custom dataset with the following config. No CUDA errors. It just moved to evaluation before it should have. Also running on a 8XA100 cluster (40GB).

model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: auto
use_flash_attention_2: true

# LoRA arguments
use_peft: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj

# Data training arguments
preprocessing_num_workers: 12

# SFT trainer config
bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 128
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: custom-model
hub_strategy: every_save
learning_rate: 2.0e-05
log_level: info
logging_steps: 5  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 8192
max_steps: -1
num_train_epochs: 1
output_dir: data/zephyr-7b-sft-lora
overwrite_output_dir: true
per_device_eval_batch_size: 8
per_device_train_batch_size: 4
push_to_hub: false
report_to:
- tensorboard
save_strategy: "no"
save_total_limit: null
seed: 42

Got this result:

[INFO|trainer.py:1723] 2023-11-16 03:59:23,969 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-16 03:59:23,969 >>   Num examples = 199,500
[INFO|trainer.py:1725] 2023-11-16 03:59:23,969 >>   Num Epochs = 1
[INFO|trainer.py:1726] 2023-11-16 03:59:23,969 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1729] 2023-11-16 03:59:23,969 >>   Total train batch size (w. parallel, distributed & accumulation) = 4,096
[INFO|trainer.py:1730] 2023-11-16 03:59:23,969 >>   Gradient Accumulation steps = 128
[INFO|trainer.py:1731] 2023-11-16 03:59:23,969 >>   Total optimization steps = 48
[INFO|trainer.py:1732] 2023-11-16 03:59:23,972 >>   Number of trainable parameters = 54,525,952
  0%|                                                                                               | 0/48 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-16 03:59:25,691 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2576 > 2048). Running this sequence through the model will result in indexingerrors
[WARNING|logging.py:314] 2023-11-16 03:59:26,260 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad`method to get a padded encoding.
{'loss': 1.004, 'learning_rate': 1.9978589232386036e-05, 'epoch': 0.02}
{'loss': 0.9924, 'learning_rate': 1.946930129495106e-05, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:42:30<14:42:30, 1231.40s/it][INFO|trainer.py:3158] 2023-11-16 05:48:55,043 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-16 05:48:55,043 >>   Num examples = 10500
[INFO|trainer.py:3163] 2023-11-16 05:48:55,043 >>   Batch size = 8
{'eval_loss': 0.9804360270500183, 'eval_runtime': 113.7535, 'eval_samples_per_second': 92.305, 'eval_steps_per_second': 1.4
51, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:51:24<14:42:30, 1231.40s/it]
[INFO|trainer.py:1955] 2023-11-16 05:50:48,798 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 6684.8263, 'train_samples_per_second': 29.844, 'train_steps_per_second': 0.007, 'train_loss': 1.0607780635356903, 'epoch': 0.1}
 10%|████████▎                                                                       | 5/48 [1:51:24<15:58:09, 1336.96s/it]
***** train metrics *****
  epoch                    =        0.1
  train_loss               =     1.0608
  train_runtime            = 1:51:24.82
  train_samples            =     199500
  train_samples_per_second =     29.844
  train_steps_per_second   =      0.007
2023-11-16 05:50:48 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3158] 2023-11-16 05:50:48,801 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-16 05:50:48,801 >>   Num examples = 10500
[INFO|trainer.py:3163] 2023-11-16 05:50:48,801 >>   Batch size = 8
 12%|█████████▊                                                                           | 19/165 [01:44<13:21,  5.49s/it]
***** eval metrics *****
  epoch                   =        0.1
  eval_loss               =     0.9817
  eval_runtime            = 0:01:52.87
  eval_samples            =      10500
  eval_samples_per_second =     93.026
  eval_steps_per_second   =      1.462
2023-11-16 05:52:41 - INFO - __main__ - *** Save model ***

Even after lowering it to 4096 tokens, it still ended early, but this time after 20% When running on the default dataset, same thing occurred but this time at 33%.

Thoughts?

lewtun commented 8 months ago

Hi @ujjawalmadan thanks for the detailed error report! I think this is related to a bug in the logging of TRL when packing=True which displays the (incorrect) total number of steps with reference to the original dataset, not the chunked one.

There's a PR here to fix that: https://github.com/huggingface/trl/pull/979

In general, I think your training runs are fine, it's just a logging issue

ujjawalmadan commented 8 months ago

Thank you! I just realized this morning that may have been the issue and saw your post. Thanks!

huggingface / alignment-handbook

Training Finishes Prematurely after Max Length increases #36