Closed lwgkzl closed 1 year ago
Due to gpu memory limitations, I cannot load the largest model onto a single gpu. So I want to shard the model onto multiple gpus. I noticed that the 'zero_stage=1' option appears to be supported in the utils.py.
if args.zero_stage == 1: ds_config.update({"zero_optimization": {"stage": args.zero_stage, "reduce_bucket_size": 5e8}}) elif args.zero_stage > 1: raise NotImplementedError()
However, when I add '--zero_stage 1' to the running script, it failed with error message:
Mon Apr 3 16:26:32 2023[1,1]<stderr>:Traceback (most recent call last): Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "run_beit3_finetuning.py", line 460, in <module> Mon Apr 3 16:26:32 2023[1,1]<stderr>: main(opts, ds_init)Mon Apr 3 16:26:32 2023[1,1]<stderr>: Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "run_beit3_finetuning.py", line 398, in main Mon Apr 3 16:26:32 2023[1,1]<stderr>: train_stats = train_one_epoch( Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/beit3/engine_for_finetuning.py", line 733, in train_one_epoch Mon Apr 3 16:26:32 2023[1,1]<stderr>: model.backward(loss) Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1201, in backward Mon Apr 3 16:26:32 2023[1,1]<stderr>: self.allreduce_gradients() Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1123, in allreduce_gradients Mon Apr 3 16:26:32 2023[1,1]<stderr>: self.optimizer.reduce_gradients( Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 489, in reduce_gradients Mon Apr 3 16:26:32 2023[1,1]<stderr>: self.reduce_ready_partitions_and_remove_grads(param, i) Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads Mon Apr 3 16:26:32 2023[1,1]<stderr>: self.reduce_independent_p_g_buckets_and_remove_grads(param, i) Mon Apr 3 16:26:32 2023[1,1]<stderr>: File "/root/env_run/py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 733, in reduce_independent_p_g_buckets_and_remove_grads Mon Apr 3 16:26:32 2023[1,1]<stderr>: assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient" Mon Apr 3 16:26:32 2023[1,1]<stderr>:AssertionError: rank 15 - Invalid to reduce Param 1 with None gradientMon Apr 3 16:26:32 2023[1,1]<stderr>:
my config in output is:
Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,721] [INFO] [engine.py:509:_configure_lr_scheduler] DeepSpeed using client LR scheduler Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0005, 0.0005], mom=[[0.9, 0.999], [0.9, 0.999]] Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:900:print] DeepSpeedEngine configuration: Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print] activation_checkpointing_config { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "partition_activations": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "contiguous_memory_optimization": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "cpu_checkpointing": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "number_checkpoints": null, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "synchronize_checkpoint_boundary": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "profile": false Mon Apr 3 16:25:34 2023[1,0]<stdout>:} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,722] [INFO] [config.py:904:print] allreduce_always_fp32 ........ False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] amp_enabled .................. False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] amp_params ................... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] checkpoint_tag_validation_enabled True Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] checkpoint_tag_validation_fail False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] disable_allgather ............ False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] dump_state ................... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_enabled ........... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_gas_boundary_resolution 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_layer_name ........ bert.encoder.layer Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_layer_num ......... 0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_max_iter .......... 100 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_stability ......... 1e-06 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_tol ............... 0.01 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] eigenvalue_verbose ........... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] elasticity_enabled ........... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] flops_profiler_config ........ { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "enabled": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "profile_step": 1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "module_depth": -1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "top_modules": 1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "detailed": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "output_file": null Mon Apr 3 16:25:34 2023[1,0]<stdout>:} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] fp16_enabled ................. True Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] fp16_mixed_quantize .......... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] global_rank .................. 0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] gradient_accumulation_steps .. 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] gradient_clipping ............ 0.0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] gradient_predivide_factor .... 1.0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] initial_dynamic_scale ........ 65536 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] loss_scale ................... 0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] memory_breakdown ............. False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] optimizer_legacy_fusion ...... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] optimizer_name ............... adam Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,723] [INFO] [config.py:904:print] optimizer_params ............. {'lr': 0.0005, 'weight_decay': 0.05, 'bias_correction': True, 'betas': [0.9, 0.999], 'eps': 1e-08} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] pld_enabled .................. False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] pld_params ................... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] prescale_gradients ........... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_change_rate ......... 0.001 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_groups .............. 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_offset .............. 1000 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_period .............. 1000 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_rounding ............ 0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_start_bits .......... 16 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_target_bits ......... 8 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_training_enabled .... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_type ................ 0 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] quantize_verbose ............. False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] scheduler_name ............... None Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] scheduler_params ............. None Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] sparse_attention ............. None Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] sparse_gradients_enabled ..... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] steps_per_print .............. 1000 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] tensorboard_enabled .......... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] tensorboard_job_name ......... DeepSpeedJobName Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] tensorboard_output_path ...... Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] train_batch_size ............. 16 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] train_micro_batch_size_per_gpu 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] use_quantizer_kernel ......... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] wall_clock_breakdown ......... False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] world_size ................... 16 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] zero_allow_untested_optimizer False Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] zero_config .................. { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "stage": 1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "contiguous_gradients": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "reduce_scatter": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "reduce_bucket_size": 5.000000e+08, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "allgather_partitions": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "allgather_bucket_size": 5.000000e+08, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "overlap_comm": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "load_from_fp32_weights": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "elastic_checkpoint": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "offload_param": null, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "offload_optimizer": null, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "sub_group_size": 1.000000e+12, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "prefetch_bucket_size": 5.000000e+07, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "param_persistence_threshold": 1.000000e+05, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "max_live_parameters": 1.000000e+09, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "max_reuse_distance": 1.000000e+09, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "gather_fp16_weights_on_model_save": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "ignore_unused_parameters": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "legacy_stage1": false Mon Apr 3 16:25:34 2023[1,0]<stdout>:} Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,724] [INFO] [config.py:904:print] zero_enabled ................. True Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:904:print] zero_optimization_stage ...... 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>:[2023-04-03 16:25:34,725] [INFO] [config.py:906:print] json = { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "train_batch_size": 16, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "train_micro_batch_size_per_gpu": 1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "steps_per_print": 1000, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "optimizer": { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "type": "Adam", Mon Apr 3 16:25:34 2023[1,0]<stdout>: "adam_w_mode": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "params": { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "lr": 0.0005, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "weight_decay": 0.05, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "bias_correction": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "betas": [0.9, 0.999], Mon Apr 3 16:25:34 2023[1,0]<stdout>: "eps": 1e-08 Mon Apr 3 16:25:34 2023[1,0]<stdout>: } Mon Apr 3 16:25:34 2023[1,0]<stdout>: }, **Mon Apr 3 16:25:34 2023[1,0]<stdout>: "fp16": { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "enabled": true, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "auto_cast": false, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "loss_scale": 0, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "initial_scale_power": 16, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "loss_scale_window": 1000, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "hysteresis": 2, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "min_loss_scale": 1 Mon Apr 3 16:25:34 2023[1,0]<stdout>: }, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "zero_optimization": { Mon Apr 3 16:25:34 2023[1,0]<stdout>: "stage": 1, Mon Apr 3 16:25:34 2023[1,0]<stdout>: "reduce_bucket_size": 5.000000e+08 Mon Apr 3 16:25:34 2023[1,0]<stdout>: }** Mon Apr 3 16:25:34 2023[1,0]<stdout>:} Mon Apr 3 16:25:34 2023[1,0]<stdout>:Using /root/.cache/torch_extensions as PyTorch extensions root...
Thank you for any help : )
Can I add your contact information so that we can communicate beit? My wechat is menghuaaa123,my qq is 2049314151.
many thanks :)
Due to gpu memory limitations, I cannot load the largest model onto a single gpu. So I want to shard the model onto multiple gpus. I noticed that the 'zero_stage=1' option appears to be supported in the utils.py.
However, when I add '--zero_stage 1' to the running script, it failed with error message:
my config in output is:
Thank you for any help : )