Closed mintuos closed 9 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
cc @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
and the deepspeed json is
i dont use adam in my code, but the log out here: Loading extension module cpu_adam... Time to load cpu_adam op: 2.8985447883605957 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.80269718170166 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.916116952896118 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability.
[2023-11-19 15:27:06,396] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-11-19 15:27:06,396] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.99)] [2023-11-19 15:27:06,669] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] amp_enabled .................. False [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] amp_params ................... False [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f407c008880> [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] communication_data_type ...... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] disable_allgather ............ False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dump_state ................... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] elasticity_enabled ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_enabled ................. False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] global_rank .................. 0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_accumulation_steps .. 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_clipping ............ 0.8 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] memory_breakdown ............. False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_name ............... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_params ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pld_enabled .................. False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pld_params ................... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] prescale_gradients ........... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] scheduler_name ............... None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] scheduler_params ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] sparse_attention ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] steps_per_print .............. inf [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] train_batch_size ............. 112 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 16 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] use_node_local_storage ....... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] weight_quantization_config ... None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] world_size ................... 7 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_allow_untested_optimizer True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_enabled ................. True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_optimization_stage ...... 2 [2023-11-19 15:27:06,671] [INFO] [config.py:962:print_user_config] json = { "train_batch_size": 112, "train_micro_batch_size_per_gpu": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "nvme_path": null }, "offload_param": { "device": "cpu", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_clipping": 0.8, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true }
i dont know if this is corr, why the log still show me with adam even if i still donot use them at all.