huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.3k stars 127 forks source link

error resuming from checkpoint if PP > 1 #221

Closed moussaKam closed 2 days ago

moussaKam commented 3 months ago

After running the toy example I run it again to resume training and I'm getting an error only if PP > 1

Here's the config:

checkpoints:
  checkpoint_interval: 25
  checkpoints_path: checkpoints
  checkpoints_path_is_shared_file_system: false
  resume_checkpoint_path: checkpoints/
  # resume_checkpoint_path: null
  save_initial_state: false
data_stages:
- data:
    dataset:
      dataset_overwrite_cache: false
      dataset_processing_num_proc_per_process: 1
      hf_dataset_config_name: null
      hf_dataset_or_datasets: stas/openwebtext-10k
      hf_dataset_splits: train
      text_column_name: text
    num_loading_workers: 1
    seed: 42
  name: Stable Training Stage
  start_training_step: 1
- data:
    dataset:
      dataset_overwrite_cache: false
      dataset_processing_num_proc_per_process: 1
      hf_dataset_config_name: null
      hf_dataset_or_datasets: stas/openwebtext-10k
      hf_dataset_splits: train
      text_column_name: text
    num_loading_workers: 1
    seed: 42
  name: Annealing Phase
  start_training_step: 10
general:
  benchmark_csv_path: null
  consumed_train_samples: null
  ignore_sanity_checks: true
  project: debug
  run: tiny_llama_%date_%jobid
  seed: 42
  step: null
lighteval: null
logging:
  iteration_step_info_interval: 1
  log_level: info
  log_level_replica: info
model:
  ddp_bucket_cap_mb: 25
  dtype: bfloat16
  init_method:
    std: 0.025
  make_vocab_size_divisible_by: 1
  model_config:
    bos_token_id: 1
    eos_token_id: 2
    hidden_act: silu
    hidden_size: 16
    initializer_range: 0.02
    intermediate_size: 64
    is_llama_config: true
    max_position_embeddings: 256
    num_attention_heads: 4
    num_hidden_layers: 2
    num_key_value_heads: 4
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_scaling: null
    tie_word_embeddings: true
    use_cache: true
    vocab_size: 256
optimizer:
  accumulate_grad_in_fp32: true
  clip_grad: 1.0
  learning_rate_scheduler:
    learning_rate: 0.0003
    lr_decay_starting_step: null
    lr_decay_steps: 13
    lr_decay_style: cosine
    lr_warmup_steps: 2
    lr_warmup_style: linear
    min_decay_lr: 1.0e-05
  optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    name: adamW
    torch_adam_is_fused: true
  weight_decay: 0.01
  zero_stage: 0
parallelism:
  dp: 2
  expert_parallel_size: 1
  pp: 2
  pp_engine: 1f1b
  tp: 1
  tp_linear_async_communication: true
  tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
  tokenizer_max_length: null
  tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel
  tokenizer_revision: null
tokens:
  batch_accumulation_per_replica: 1
  limit_test_batches: 0
  limit_val_batches: 0
  micro_batch_size: 2
  sequence_length: 256
  train_steps: 1000
  val_check_interval: -1

Here's the error:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/mkamaleddine/nanotron/run_train.py", line 233, in <module>
[rank3]:     trainer = DistributedTrainer(config_file)
[rank3]:   File "/home/mkamaleddine/nanotron/src/nanotron/trainer.py", line 207, in __init__
[rank3]:     load_lr_scheduler(
[rank3]:   File "/home/mkamaleddine/nanotron/src/nanotron/serialize/optimizer.py", line 321, in load_lr_scheduler
[rank3]:     lr_scheduler.load_state_dict(state_dict)
[rank3]:   File "/home/mkamaleddine/anaconda3/envs/nanotron/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 380, in load_state_dict
[rank3]:     self.lr_lambdas[idx].__dict__.update(fn)
[rank3]: IndexError: list index out of range
TJ-Solergibert commented 2 months ago

Hi @moussaKam,

I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself

https://github.com/swiss-ai/nanotron/commit/664c09aa48204b8a45756e15fac9cc6bf0b38ccf

Toni

moussaKam commented 2 months ago

Thanks @TJ-Solergibert

alexchen4ai commented 4 days ago

Hi @moussaKam,

I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself

swiss-ai@664c09a

Toni

If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.

alexchen4ai commented 4 days ago

Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni

If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.

Just solved it by hardcode. NVM.

TJ-Solergibert commented 4 days ago

Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni

If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.

Just solved it by hardcode. NVM.

Nice! I was going to suggest you training with the fixed PR for a single iteration, storing 1 ckpt after a single iteration and then coping from the SINGLE original .pt checkpoint the values to the new .pt files. You can't directly duplicate the original file because most likely they will have a different size/shape in each and every PP rank.

alexchen4ai commented 4 days ago

Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni

If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.

Just solved it by hardcode. NVM.

Nice! I was going to suggest you training with the fixed PR for a single iteration, storing 1 ckpt after a single iteration and then coping from the SINGLE original .pt checkpoint the values to the new .pt files. You can't directly duplicate the original file because most likely they will have a different size/shape in each and every PP rank.

Thanks. What I did is to hardcode the current lr scheduler.

if self.init_checkpoint_path is not None:
    try:
        load_lr_scheduler(
            lr_scheduler=self.lr_scheduler,
            root_folder=self.init_checkpoint_path,
        )
    except (IndexError, RuntimeError) as e:
        logger.warning(f"Failed to load lr_scheduler state: {e}. Initializing new scheduler.")
        # Calculate the correct learning rate based on progress
        checkpoint_metadata = load_meta(
            parallel_context=self.parallel_context, 
            root_folder=self.init_checkpoint_path
        )
        assert isinstance(checkpoint_metadata.metas, TrainingMetadata)
        current_step = checkpoint_metadata.metas.last_train_step

        # Fast-forward the scheduler to current step
        for _ in range(current_step):
            self.lr_scheduler.step()`

It at least now solved the problems. I will use the new codebase with your PR for future training.

NouamaneTazi commented 3 days ago

Thank you for opening the issue @moussaKam ! The issue happens because LambdaLR creates as many lr_lambdas as we have param_groups And whereas we only had a single param_group containing all params before, we recently opted to have a single param per param_group which is what created this issue (every process has a different number of params = param_groups = lr_lambdas ) Nonetheless @alexchen4ai fixing this is easy as you can just load the lr_lambdas for a single param_group and duplicate it (using deepcopies). Assuming ofc that you want all your parameters to follow the same lr scheduler which is the default in nanotron!