[Bug] Fix Zero-0 Optimizer States Not in Sync When Merging from Different Topologies

Describe the bug: If you run the optimizer merging from TP=4 to TP=2 about three times, it raises an error two out of those times, like bellow. I try sorting the names, and it somehow just works. I rerun this around ten times, and the error no longer occurs!!!

Reproduce:

Resume training from an optimizer state with TP=4 to TP=2 from an existing checkpoint:

USE_FAST=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 /fsx/phuc/projects/nanotron/run_train.py --config-file downloads/debug_optim/zero0/config_tiny_llama_dp_2_tp2_pp1_with_no_zero.yaml

Generate a TP=4 checkpoint from scratch and continue with TP=2

./fsx/phuc/projects/nanotron/downloads/debug_optim/test_loading_optimizer.sh --zero_stage=0

The error:

Saving weights:   0%|          | 0/15 [00:00<?, ?it/s]{'checkpoints': {'checkpoint_interval': 10, 'checkpoints_path': '/fsx/phuc/checkpoints/nanotron-optim-loading/no_zero1_dp_2_tp2_pp1', 'checkpoints_path_is_shared_file_system': True, 'resume_checkpoint_path': '/fsx/phuc/checkpoints/nanotron-optim-loading/no_zero1_dp_2_tp4_pp1', 'save_initial_state': False}, 'data': {'dataset': {'dataset_overwrite_cache': False, 'dataset_processing_num_proc_per_process': 1, 'hf_dataset_config_name': None, 'hf_dataset_or_datasets': 'TIGER-Lab/MathInstruct', 'hf_dataset_splits': 'train', 'text_column_name': 'output'}, 'num_loading_workers': 1, 'seed': 42}, 'general': {'benchmark_csv_path': None, 'consumed_train_samples': 600, 'ignore_sanity_checks': False, 'project': 'debug', 'run': 'tiny_llama', 'seed': 42, 'step': 30}, 'logging': {'iteration_step_info_interval': 1, 'log_level': 'info', 'log_level_replica': 'info'}, 'model': {'ddp_bucket_cap_mb': 25, 'dtype': 'bfloat16', 'init_method': {'std': 0.025}, 'make_vocab_size_divisible_by': 1, 'model_config': {'bos_token_id': 1, 'eos_token_id': 2, 'hidden_act': 'silu', 'hidden_size': 16, 'initializer_range': 0.02, 'intermediate_size': 64, 'is_llama_config': True, 'max_position_embeddings': 32, 'num_attention_heads': 4, 'num_hidden_layers': 2, 'num_key_value_heads': 4, 'pad_token_id': None, 'pretraining_tp': 1, 'rms_norm_eps': 1e-05, 'rope_scaling': None, 'tie_word_embeddings': True, 'use_cache': True, 'vocab_size': 50272}}, 'optimizer': {'accumulate_grad_in_fp32': True, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-08, 'clip_grad': 1.0, 'learning_rate_scheduler': {'learning_rate': 0.0003, 'lr_decay_steps': 8, 'lr_decay_style': 'cosine', 'lr_warmup_steps': 2, 'lr_warmup_style': 'linear', 'min_decay_lr': 1e-05}, 'torch_adam_is_fused': True, 'weight_decay': 0.01, 'zero_stage': 0}, 'parallelism': {'dp': 2, 'pp': 1, 'pp_engine': '1f1b', 'recompute_granularity': 'SELECTIVE', 'tp': 2, 'tp_linear_async_communication': True, 'tp_mode': 'REDUCE_SCATTER'}, 'profiler': None, 'tokenizer': {'tokenizer_max_length': None, 'tokenizer_name_or_path': 'gpt2', 'tokenizer_revision': None}, 'tokens': {'batch_accumulation_per_replica': 1, 'limit_test_batches': 0, 'limit_val_batches': 0, 'micro_batch_size': 10, 'sequence_length': 32, 'train_steps': 30, 'val_check_interval': -1}}
Saving weights: 100%|██████████| 15/15 [00:00<00:00, 456.48it/s]
Saving weights: 100%|██████████| 15/15 [00:00<00:00, 377.41it/s]
Traceback (most recent call last):
  File "/fsx/phuc/projects/nanotron/run_train.py", line 136, in <module>
    trainer.train(dataloader)
  File "/fsx/phuc/projects/nanotron/src/nanotron/trainer.py", line 272, in train
    self.save_checkpoint()
  File "/fsx/phuc/projects/nanotron/src/nanotron/trainer.py", line 715, in save_checkpoint
    save(
  File "/fsx/phuc/projects/nanotron/src/nanotron/serialize/main.py", line 142, in save
    assert_tensor_synced_across_pg(
  File "/fsx/phuc/projects/nanotron/src/nanotron/sanity_checks.py", line 36, in assert_tensor_synced_across_pg
    torch.testing.assert_close(tensor, reference_tensor, msg=msg)
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: exp_avg_sq are not synced across DP Tensor-likes are not close!

Mismatched elements: 4268 / 402176 (1.1%)
Greatest absolute difference: 0.006173309404402971 at index (405, 3) (up to 1e-05 allowed)
Greatest relative difference: 3.53262996673584 at index (6233, 15) (up to 1.3e-06 allowed)
[2024-01-22 12:18:36,673] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1400849 closing signal SIGTERM
[2024-01-22 12:18:36,674] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1400850 closing signal SIGTERM
[2024-01-22 12:18:36,674] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1400851 closing signal SIGTERM
[2024-01-22 12:18:36,737] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 1400852) of binary: /admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/bin/python
Traceback (most recent call last):
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/admin/home/phuc_nguyen/miniconda3/envs/nanotron-dev/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/fsx/phuc/projects/nanotron/run_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-22_12:18:36
  host      : ip-26-0-167-51.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1400852)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

huggingface / nanotron

[Bug] Fix Zero-0 Optimizer States Not in Sync When Merging from Different Topologies #37