Describe the bug DeepSpeed ZeRO++ features aren't working:

On a single node, passing zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights leads to foward pass error with BF16. Exact issue reported in https://github.com/microsoft/DeepSpeed/issues/4852.
On single node, passing zero_hpz_partition_size, zero_quantized_gradients works with BF16 but I don't notice any speedup at all.
On a single node, passing zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights works with FP16 but I don't notice any speedup at all. 4% reduction in memory.
On multi node (2 nodes), passing zero_hpz_partition_size, zero_quantized_gradients, zero_quantized_weights fails with FP16 as loss suddenly goes to inf and the scaling factor keeps reducing till 1 post which error is raised.
On multi node (2 nodes), passing zero_hpz_partition_size, zero_quantized_gradients fails with BF16 as loss suddenly shoots at the start to 2409 and then goes to inf.
On multi node (4-nodes), with and without Hybrid Sharding (zero_hpz_partition_size: 8): a. No speedup with Hybrid Sharding b. Training loss curves are similar in both cases unlike the issue https://github.com/microsoft/DeepSpeed/issues/4851. c. Eval loss is very high in spite of using the same seed. The only diff is zero_hpz_partition_size: 8. Refer to the screenshot below. d. When redoing the above experiments for Llama-70B model, following observations: Hybrid Sharding gave below error inspite of decreasing per gpu batch sizes from 4 to 2 to 1.
```
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
```

To Reproduce Steps to reproduce the behavior:

DeepSpeed config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/ds_config_z3.json. Add the flags zero_hpz_partition_size , zero_quantized_gradients , zero_quantized_weights as per the experiment being done.
Accelerate config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/deepspeed_zeropp_config.yaml
Launch command: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/run_deepspeed_zeropp.sh.

launch Command on Multi-Node setup:


#!/bin/bash
#SBATCH --job-name=ift_llama
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1                 # Crucial - only 1 task per dist per node!
##SBATCH --mem-per-cpu=11G                  # Uncomment to enable "mix" use of GPUs across cluster users
#SBATCH --requeue
#SBATCH --gres=gpu:8
#SBATCH --partition=cluster_name
#SBATCH --output=/path/to/temp/logs/%x-%j.out
#SBATCH --err=/path/to/temp/logs/%x-%j.err

set -x -e

CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS

source ~/.bashrc cd /path/to/DHS-LLM-Workshop/chat_assistant/training git pull

export NCCL_ASYNC_ERROR_HANDLING=1 export WANDB_PROJECT=deepspeed_zeropp echo "START TIME: $(date)"

CHANGE TO CUMMULATIVELY LOG OUTPUTS

LOG_PATH="main_log.txt"

GPUS_PER_NODE=8 NNODES=$SLURM_NNODES NUM_PROCESSES=$(expr $NNODES * $GPUS_PER_NODE)

so processes know who to talk to

MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) MASTER_PORT=6000

OTHER LAUNCHERS CAN BE USED HERE

export LAUNCHER="accelerate launch \ --config_file configs/deepspeed_zeropp_config.yaml \ --main_process_ip $MASTER_ADDR \ --main_process_port $MASTER_PORT \ --machine_rank \$SLURM_PROCID \ --num_processes $NUM_PROCESSES \ --num_machines $NNODES \ "

Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable

export PROGRAM="\ train.py \ --seed 100 \ --model_name "meta-llama/Llama-2-70b-hf" \ --dataset_name "HuggingFaceH4/ultrachat_200k" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train_sft,test_sft" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 2e-5 \ --lr_scheduler_type "cosine" \ --weight_decay 0.0 \ --warmup_ratio 0.1 \ --max_grad_norm 1.0 \ --output_dir "llama-sft-ds-multinode-zpp" \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --dataset_text_field "content" \ --use_flash_attn True \ --gradient_checkpointing True \ --use_reentrant False "

export CMD="$LAUNCHER $PROGRAM" srun --jobid $SLURM_JOBID bash -c "$CMD" 2>&1 | tee -a $LOG_PATH echo "END TIME: $(date)"


**Expected behavior**
1. Hybrid Sharding `zero_hpz_partition_size` should result in a speed up on Multi-node setup (4 nodes experimented above)
2. Hybrid Sharding `zero_hpz_partition_size` should not result in OOM with 70B model on multi-node setup (4 nodes) wherein each node has 8X80Gb GPUs.
3. Hybrid Sharding `zero_hpz_partition_size` should result in same eval loss as normal Z3 finetuning on Multi-node setup (4 nodes experimented above)
4. `zero_hpz_partition_size` , `zero_quantized_gradients` , `zero_quantized_weights` should work with Mixed Precision training in `BF16`
5. On multi-node, `zero_hpz_partition_size` , `zero_quantized_gradients` , `zero_quantized_weight`should work with FP16/BF16 without loss jumping to INF.

**ds_report output**

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch'] torch version .................... 2.1.2+cu121 deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.12.5, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1 shared memory (/dev/shm) size .... 999.99 GB



**System info (please complete the following information):**
 - OS: Ubuntu 20.04.6 LTS
 - GPU count and types One machine with x8 H100s each
 - Python version 3.10.13

**Launcher context**
Accelerate launcher which internally uses the DeepSpeed launcher.

microsoft / DeepSpeed

[BUG] DeepSpeed ZeRO++ features aren't working #4886

CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS

CHANGE TO CUMMULATIVELY LOG OUTPUTS

so processes know who to talk to

OTHER LAUNCHERS CAN BE USED HERE

Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

microsoft / DeepSpeed

[BUG] DeepSpeed ZeRO++ features aren't working #4886

CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS

CHANGE TO CUMMULATIVELY LOG OUTPUTS

so processes know who to talk to

OTHER LAUNCHERS CAN BE USED HERE

Note: it is important to escape $SLURM_PROCID since we want the srun on each node to evaluate this variable

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Note: it is important to escape `$SLURM_PROCID` since we want the srun on each node to evaluate this variable