why accelerate with deepspeed always use --> DeepSpeed Final Optimizer = DeepSpeedCPUAdam

System Info

I use accelerate with deepspeed, and i finish read this web https://huggingface.co/docs/accelerate/usage_guides/deepspeed

I build a config file to run --config_file,
like this  
accelerate launch train/main_ds.py --config_file ./accelerate_config.yaml

here is the config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config: 
    deepspeed_config_file: ./stage2.json
    zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main_ds
mixed_precision: bf16
num_machines: 1
num_processes: 7
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

and the deepspeed json is

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

i dont use adam in my code, but the log out here: Loading extension module cpu_adam... Time to load cpu_adam op: 2.8985447883605957 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.80269718170166 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.916116952896118 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability.

[2023-11-19 15:27:06,396] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-11-19 15:27:06,396] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>

[2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2023-11-19 15:27:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.99)] [2023-11-19 15:27:06,669] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] amp_enabled .................. False [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] amp_params ................... False [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-11-19 15:27:06,669] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f407c008880> [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] communication_data_type ...... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] disable_allgather ............ False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dump_state ................... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] elasticity_enabled ........... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_enabled ................. False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] global_rank .................. 0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_accumulation_steps .. 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_clipping ............ 0.8 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] memory_breakdown ............. False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_name ............... None [2023-11-19 15:27:06,670] [INFO] [config.py:976:print] optimizer_params ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pld_enabled .................. False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] pld_params ................... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] prescale_gradients ........... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] scheduler_name ............... None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] scheduler_params ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] sparse_attention ............. None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] steps_per_print .............. inf [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] train_batch_size ............. 112 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 16 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] use_node_local_storage ....... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] weight_quantization_config ... None [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] world_size ................... 7 [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_allow_untested_optimizer True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_enabled ................. True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2023-11-19 15:27:06,671] [INFO] [config.py:976:print] zero_optimization_stage ...... 2 [2023-11-19 15:27:06,671] [INFO] [config.py:962:print_user_config] json = { "train_batch_size": 112, "train_micro_batch_size_per_gpu": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "nvme_path": null }, "offload_param": { "device": "cpu", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_clipping": 0.8, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true }

i dont know if this is corr, why the log still show me with adam even if i still donot use them at all.



### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [X] My own task or dataset (give details below)

### Reproduction

import torch.nn as nn
from sklearn.model_selection import train_test_split
from torch.utils.tensorboard import SummaryWriter
from accelerate import Accelerator
from lion_pytorch import Lion

from train.datasets import *
from train.model import Transformer
from train.evaluate import evaluate
from train.model_save import kmodel, epoch_log, read_epoch
import setproctitle

accelerator = Accelerator()
device = accelerator.device

# -------------------------data----------------------------------------
enc_inputs, dec_inputs = torch.load('data/encodingData.pth')
enc_inputs_train, enc_inputs_test, dec_inputs_train, dec_inputs_test = train_test_split(enc_inputs, dec_inputs,
                                                                                        test_size=0.2, random_state=42)
train_data = MyDataSet(enc_inputs_train, dec_inputs_train)
test_data = MyDataSet(enc_inputs_test, dec_inputs_test)
train_loader = Data.DataLoader(train_data, batch_size=16, shuffle=True)
test_loader = Data.DataLoader(test_data, batch_size=16, shuffle=True)
# ----------------------------------------------------------------------
# ------------------------------model----------------------------------------
model = Transformer()
optimizer = Lion(model.parameters(), lr=5e-6, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss(ignore_index=0)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_loader, test_loader)
for epoch in range(oldepoch, 500):

### Expected behavior

i read this

Important code changes when using DeepSpeed Config File

DeepSpeed Optimizers and Schedulers. For more information on these, see the [DeepSpeed Optimizers](https://deepspeed.readthedocs.io/en/latest/optimizers.html) and [DeepSpeed Schedulers](https://deepspeed.readthedocs.io/en/latest/schedulers.html) documentation. We will look at the changes needed in the code when using these.

so i want use this 

b. Custom Optim + Custom Scheduler: The case when both optimizer and scheduler keys are absent in the DeepSpeed config file. In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin. In the above example we can see that the code remains unchanged if the optimizer and scheduler keys are absent in the DeepSpeed config file.

so i delete the optimizer in deepspeed_config_file

but i want to know that if the adam here is corr , thank u!

huggingface / accelerate

why accelerate with deepspeed always use --> DeepSpeed Final Optimizer = DeepSpeedCPUAdam #2169

System Info