Deepspeed Zero Stage 3 save a empty model state_dict

Train with bf16 and zero stage 3 cause this error, the script:

######################################
ID=02  
DS_CONFIG=${BASE_PATH}/ds_config_${ID}.json
ZERO_STAGE=3

cat <<EOT > $DS_CONFIG
{      
  "train_batch_size" : $GLOBAL_BATCH_SIZE,
  "train_micro_batch_size_per_gpu" : $MICRO_BATCH_SIZE,
  "steps_per_print": 10,
  "optimizer" : {
    "type": "Adam",
    "params": {
        "lr": "${LR}",
        "betas": "auto",
        "eps": 1e-5,
        "weight_decay": "${WEIGHT_DECAY}"
    }  
  },   

  "zero_optimization": {
    "stage": ${ZERO_STAGE},
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    }, 
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    }, 
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1000000000.0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1000000000.0,
    "stage3_max_reuse_distance": 1000000000.0,
    "stage3_gather_16bit_weights_on_model_save": true
  },   
  "bf16": {
    "enabled": true
  },   
  "flops_profiler": {
    "enabled": true,
    "profile_step": 20,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": "${CHECKPOINT_PATH}/flops_profiler.log"
  }    
}
EOT
######################################

ds_args=""
ds_args=" --deepspeed ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS \
       ${CODE_PATH}/pretrain_gpt.py \
       --tensor-model-parallel-size $TP \
       --pipeline-model-parallel-size $PP \
       --no-pipeline-parallel \
       --num-layers $NUM_LAYERS \
       --hidden-size $HIDDEN_SIZE \
       --ffn-hidden-size $FFN_HIDDEN_SIZE \
       --num-attention-heads $NUM_HEADS \
       --micro-batch-size $MICRO_BATCH_SIZE \
       --global-batch-size $GLOBAL_BATCH_SIZE \
       --seq-length $SEQ_LENGTH \
       --max-position-embeddings $SEQ_LENGTH \
       --train-iters $TRAIN_STEPS \
       --save $CHECKPOINT_PATH \
       --load $PREMODEL \
       --train-data-path $DATASET \
       --valid-data-path $DATASET_VALID \
       --data-impl mmap \
       --tokenizer-type HFTokenizer \
       --tokenizer-model $TOKENIZER_PATH \
       --distributed-backend nccl \
       --lr $LR \
       --lr-decay-style cosine \
       --min-lr $MIN_LR \
       --weight-decay $WEIGHT_DECAY \
       --clip-grad $GRAD_CLIP \
       --lr-warmup-iters $LR_WARMUP_STEPS \
       --optimizer adam \
       --adam-beta1 0.9 \
       --adam-beta2 0.95 \
       --log-interval 1 \ 
       --save-interval 5 \ 
       --eval-interval 10000000 \
       --eval-iters 1000 \
       --use-rotary-position-embeddings \
       --untie-embeddings-and-output-weights \
       --num-key-value-heads $NUM_KV_HEADS \
       --no-query-key-layer-scaling \
       --attention-dropout 0 \ 
       --hidden-dropout 0 \ 
       --swiglu \
       --normalization rmsnorm \
       --disable-bias-linear \
       --use-flash-attn \
       --bf16 \
       $ds_args

And the log info:

[2023-11-23 17:00:35,408] [INFO] [utils.py:785:see_memory_usage] After Building Model
[2023-11-23 17:00:35,409] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.46 GB         CA 0.76 GB         Max_CA 1 GB 
[2023-11-23 17:00:35,409] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 140.73 GB, percent = 14.0%
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6972248064
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-11-23 17:00:35,412] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2023-11-23 17:00:35,421] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
[2023-11-23 17:00:35,422] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-11-23 17:00:35,422] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
Traceback (most recent call last):
  File "/Megatron-DeepSpeed-master-A100/pretrain_gpt.py", line 338, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/Megatron-DeepSpeed-master-A100/megatron/training.py", line 135, in pretrain
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/Megatron-DeepSpeed-master-A100/megatron/training.py", line 579, in setup_model_and_optimizer
    model, optimizer, _, opt_param_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 310, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1196, in _configure_optimizer
    raise ZeRORuntimeException(msg)
deepspeed.runtime.zero.utils.ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'apex.optimizers.fused_adam.FusedAdam'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.
[2023-11-23 17:00:39,667] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 255) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.0.dev20230912+cu118', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

microsoft / Megatron-DeepSpeed

Deepspeed Zero Stage 3 save a empty model state_dict #298