microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

BEIT3: How can i shard one model to multi-gpus #1051

Closed lwgkzl closed 1 year ago

lwgkzl commented 1 year ago

Due to gpu memory limitations, I cannot load the largest model onto a single gpu. So I want to shard the model onto multiple gpus. I noticed that the 'zero_stage=1' option appears to be supported in the utils.py.

        if args.zero_stage == 1:
            ds_config.update({"zero_optimization": {"stage": args.zero_stage, "reduce_bucket_size": 5e8}})
        elif args.zero_stage > 1:
            raise NotImplementedError()

However, when I add '--zero_stage 1' to the running script, it failed with error message:

Mon Apr  3 16:26:32 2023[1,1]<stderr>:Traceback (most recent call last):
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "run_beit3_finetuning.py", line 460, in <module>
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    main(opts, ds_init)Mon Apr  3 16:26:32 2023[1,1]<stderr>:
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "run_beit3_finetuning.py", line 398, in main
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    train_stats = train_one_epoch(
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/beit3/engine_for_finetuning.py", line 733, in train_one_epoch
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    model.backward(loss)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1201, in backward
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.allreduce_gradients()
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1123, in allreduce_gradients
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.optimizer.reduce_gradients(
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 489, in reduce_gradients
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.reduce_ready_partitions_and_remove_grads(param, i)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 733, in reduce_independent_p_g_buckets_and_remove_grads
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
Mon Apr  3 16:26:32 2023[1,1]<stderr>:AssertionError: rank 15 - Invalid to reduce Param 1 with None gradientMon Apr  3 16:26:32 2023[1,1]<stderr>:

my config in output is:

Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [engine.py:509:_configure_lr_scheduler] DeepSpeed using client LR scheduler
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0005, 0.0005], mom=[[0.9, 0.999], [0.9, 0.999]]
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   activation_checkpointing_config  {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "partition_activations": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "contiguous_memory_optimization": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "cpu_checkpointing": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "number_checkpoints": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "synchronize_checkpoint_boundary": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "profile": false
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   allreduce_always_fp32 ........ False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   amp_enabled .................. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   amp_params ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   checkpoint_tag_validation_enabled  True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   checkpoint_tag_validation_fail  False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   disable_allgather ............ False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   dump_state ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_enabled ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_gas_boundary_resolution  1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_layer_name ........ bert.encoder.layer
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_layer_num ......... 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_max_iter .......... 100
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_stability ......... 1e-06
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_tol ............... 0.01
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_verbose ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   elasticity_enabled ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   flops_profiler_config ........ {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "enabled": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "profile_step": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "module_depth": -1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "top_modules": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "detailed": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "output_file": null
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   fp16_enabled ................. True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   fp16_mixed_quantize .......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   global_rank .................. 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_accumulation_steps .. 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_clipping ............ 0.0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_predivide_factor .... 1.0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   initial_dynamic_scale ........ 65536
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   loss_scale ................... 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   memory_breakdown ............. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_legacy_fusion ...... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_name ............... adam
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_params ............. {'lr': 0.0005, 'weight_decay': 0.05, 'bias_correction': True, 'betas': [0.9, 0.999], 'eps': 1e-08}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pld_enabled .................. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pld_params ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   prescale_gradients ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_change_rate ......... 0.001
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_groups .............. 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_offset .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_period .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_rounding ............ 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_start_bits .......... 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_target_bits ......... 8
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_training_enabled .... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_type ................ 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_verbose ............. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   scheduler_name ............... None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   scheduler_params ............. None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   sparse_attention ............. None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   sparse_gradients_enabled ..... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   steps_per_print .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_enabled .......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_job_name ......... DeepSpeedJobName
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_output_path ...... 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   train_batch_size ............. 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   train_micro_batch_size_per_gpu  1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   use_quantizer_kernel ......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   wall_clock_breakdown ......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   world_size ................... 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_allow_untested_optimizer  False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_config .................. {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "stage": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "contiguous_gradients": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "reduce_scatter": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "reduce_bucket_size": 5.000000e+08, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "allgather_partitions": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "allgather_bucket_size": 5.000000e+08, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "overlap_comm": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "load_from_fp32_weights": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "elastic_checkpoint": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "offload_param": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "offload_optimizer": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "sub_group_size": 1.000000e+12, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "prefetch_bucket_size": 5.000000e+07, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "param_persistence_threshold": 1.000000e+05, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "max_live_parameters": 1.000000e+09, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "max_reuse_distance": 1.000000e+09, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "gather_fp16_weights_on_model_save": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "ignore_unused_parameters": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "legacy_stage1": false
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_enabled ................. True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:904:print]   zero_optimization_stage ...... 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:906:print]   json = {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "train_batch_size": 16, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "train_micro_batch_size_per_gpu": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "steps_per_print": 1000, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "optimizer": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "type": "Adam", 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "adam_w_mode": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "params": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "lr": 0.0005, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "weight_decay": 0.05, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "bias_correction": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "betas": [0.9, 0.999], 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "eps": 1e-08
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        }
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }, 
**Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "fp16": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "enabled": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "auto_cast": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "loss_scale": 0, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "initial_scale_power": 16, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "loss_scale_window": 1000, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "hysteresis": 2, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "min_loss_scale": 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "zero_optimization": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "stage": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "reduce_bucket_size": 5.000000e+08
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }**
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:Using /root/.cache/torch_extensions as PyTorch extensions root...

Thank you for any help : )

menghuaa commented 1 year ago

Due to gpu memory limitations, I cannot load the largest model onto a single gpu. So I want to shard the model onto multiple gpus. I noticed that the 'zero_stage=1' option appears to be supported in the utils.py.

        if args.zero_stage == 1:
            ds_config.update({"zero_optimization": {"stage": args.zero_stage, "reduce_bucket_size": 5e8}})
        elif args.zero_stage > 1:
            raise NotImplementedError()

However, when I add '--zero_stage 1' to the running script, it failed with error message:

Mon Apr  3 16:26:32 2023[1,1]<stderr>:Traceback (most recent call last):
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "run_beit3_finetuning.py", line 460, in <module>
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    main(opts, ds_init)Mon Apr  3 16:26:32 2023[1,1]<stderr>:
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "run_beit3_finetuning.py", line 398, in main
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    train_stats = train_one_epoch(
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/beit3/engine_for_finetuning.py", line 733, in train_one_epoch
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    model.backward(loss)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1201, in backward
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.allreduce_gradients()
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1123, in allreduce_gradients
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.optimizer.reduce_gradients(
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 489, in reduce_gradients
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.reduce_ready_partitions_and_remove_grads(param, i)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
Mon Apr  3 16:26:32 2023[1,1]<stderr>:  File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 733, in reduce_independent_p_g_buckets_and_remove_grads
Mon Apr  3 16:26:32 2023[1,1]<stderr>:    assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
Mon Apr  3 16:26:32 2023[1,1]<stderr>:AssertionError: rank 15 - Invalid to reduce Param 1 with None gradientMon Apr  3 16:26:32 2023[1,1]<stderr>:

my config in output is:

Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [engine.py:509:_configure_lr_scheduler] DeepSpeed using client LR scheduler
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0005, 0.0005], mom=[[0.9, 0.999], [0.9, 0.999]]
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   activation_checkpointing_config  {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "partition_activations": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "contiguous_memory_optimization": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "cpu_checkpointing": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "number_checkpoints": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "synchronize_checkpoint_boundary": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "profile": false
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print]   allreduce_always_fp32 ........ False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   amp_enabled .................. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   amp_params ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   checkpoint_tag_validation_enabled  True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   checkpoint_tag_validation_fail  False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   disable_allgather ............ False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   dump_state ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_enabled ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_gas_boundary_resolution  1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_layer_name ........ bert.encoder.layer
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_layer_num ......... 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_max_iter .......... 100
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_stability ......... 1e-06
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_tol ............... 0.01
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   eigenvalue_verbose ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   elasticity_enabled ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   flops_profiler_config ........ {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "enabled": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "profile_step": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "module_depth": -1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "top_modules": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "detailed": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "output_file": null
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   fp16_enabled ................. True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   fp16_mixed_quantize .......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   global_rank .................. 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_accumulation_steps .. 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_clipping ............ 0.0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   gradient_predivide_factor .... 1.0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   initial_dynamic_scale ........ 65536
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   loss_scale ................... 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   memory_breakdown ............. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_legacy_fusion ...... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_name ............... adam
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print]   optimizer_params ............. {'lr': 0.0005, 'weight_decay': 0.05, 'bias_correction': True, 'betas': [0.9, 0.999], 'eps': 1e-08}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pld_enabled .................. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   pld_params ................... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   prescale_gradients ........... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_change_rate ......... 0.001
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_groups .............. 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_offset .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_period .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_rounding ............ 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_start_bits .......... 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_target_bits ......... 8
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_training_enabled .... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_type ................ 0
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   quantize_verbose ............. False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   scheduler_name ............... None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   scheduler_params ............. None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   sparse_attention ............. None
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   sparse_gradients_enabled ..... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   steps_per_print .............. 1000
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_enabled .......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_job_name ......... DeepSpeedJobName
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   tensorboard_output_path ...... 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   train_batch_size ............. 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   train_micro_batch_size_per_gpu  1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   use_quantizer_kernel ......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   wall_clock_breakdown ......... False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   world_size ................... 16
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_allow_untested_optimizer  False
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_config .................. {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "stage": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "contiguous_gradients": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "reduce_scatter": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "reduce_bucket_size": 5.000000e+08, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "allgather_partitions": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "allgather_bucket_size": 5.000000e+08, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "overlap_comm": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "load_from_fp32_weights": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "elastic_checkpoint": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "offload_param": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "offload_optimizer": null, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "sub_group_size": 1.000000e+12, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "prefetch_bucket_size": 5.000000e+07, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "param_persistence_threshold": 1.000000e+05, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "max_live_parameters": 1.000000e+09, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "max_reuse_distance": 1.000000e+09, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "gather_fp16_weights_on_model_save": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "ignore_unused_parameters": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "legacy_stage1": false
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print]   zero_enabled ................. True
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:904:print]   zero_optimization_stage ...... 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:906:print]   json = {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "train_batch_size": 16, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "train_micro_batch_size_per_gpu": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "steps_per_print": 1000, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "optimizer": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "type": "Adam", 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "adam_w_mode": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "params": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "lr": 0.0005, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "weight_decay": 0.05, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "bias_correction": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "betas": [0.9, 0.999], 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:            "eps": 1e-08
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        }
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }, 
**Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "fp16": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "enabled": true, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "auto_cast": false, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "loss_scale": 0, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "initial_scale_power": 16, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "loss_scale_window": 1000, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "hysteresis": 2, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "min_loss_scale": 1
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    "zero_optimization": {
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "stage": 1, 
Mon Apr  3 16:25:34 2023[1,0]<stdout>:        "reduce_bucket_size": 5.000000e+08
Mon Apr  3 16:25:34 2023[1,0]<stdout>:    }**
Mon Apr  3 16:25:34 2023[1,0]<stdout>:}
Mon Apr  3 16:25:34 2023[1,0]<stdout>:Using /root/.cache/torch_extensions as PyTorch extensions root...

Thank you for any help : )

Can I add your contact information so that we can communicate beit? My wechat is menghuaaa123,my qq is 2049314151.

lwgkzl commented 1 year ago

many thanks :)