Open pacman100 opened 11 months ago
same issue.
3 nodes,8 v100 per node,FP16
and zero_hpz_partition_size=8
, not setting zero_quantized_gradients
and zero_quantized_weights
keep reporting:
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
and finally
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Describe the bug DeepSpeed ZeRO++ features aren't working:
zero_hpz_partition_size
,zero_quantized_gradients
,zero_quantized_weights
leads to foward pass error withBF16
. Exact issue reported in https://github.com/microsoft/DeepSpeed/issues/4852.zero_hpz_partition_size
,zero_quantized_gradients
works withBF16
but I don't notice any speedup at all.zero_hpz_partition_size
,zero_quantized_gradients
,zero_quantized_weights
fails withFP16
as loss suddenly goes to inf and the scaling factor keeps reducing till 1 post which error is raised.zero_hpz_partition_size
,zero_quantized_gradients
fails with BF16 as loss suddenly shoots at the start to 2409 and then goes to inf.To Reproduce Steps to reproduce the behavior:
zero_hpz_partition_size
,zero_quantized_gradients
,zero_quantized_weights
as per the experiment being done.set -x -e
CHANGE HERE THE CONDA EVN AND ANY STARTUP SCRIPTS
source ~/.bashrc cd /path/to/DHS-LLM-Workshop/chat_assistant/training git pull
export NCCL_ASYNC_ERROR_HANDLING=1 export WANDB_PROJECT=deepspeed_zeropp echo "START TIME: $(date)"
CHANGE TO CUMMULATIVELY LOG OUTPUTS
LOG_PATH="main_log.txt"
GPUS_PER_NODE=8 NNODES=$SLURM_NNODES NUM_PROCESSES=$(expr $NNODES * $GPUS_PER_NODE)
so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) MASTER_PORT=6000
OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="accelerate launch \ --config_file configs/deepspeed_zeropp_config.yaml \ --main_process_ip $MASTER_ADDR \ --main_process_port $MASTER_PORT \ --machine_rank \$SLURM_PROCID \ --num_processes $NUM_PROCESSES \ --num_machines $NNODES \ "
Note: it is important to escape
$SLURM_PROCID
since we want the srun on each node to evaluate this variableexport PROGRAM="\ train.py \ --seed 100 \ --model_name "meta-llama/Llama-2-70b-hf" \ --dataset_name "HuggingFaceH4/ultrachat_200k" \ --chat_template_format "chatml" \ --add_special_tokens False \ --append_concat_token False \ --splits "train_sft,test_sft" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 2e-5 \ --lr_scheduler_type "cosine" \ --weight_decay 0.0 \ --warmup_ratio 0.1 \ --max_grad_norm 1.0 \ --output_dir "llama-sft-ds-multinode-zpp" \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --dataset_text_field "content" \ --use_flash_attn True \ --gradient_checkpointing True \ --use_reentrant False "
export CMD="$LAUNCHER $PROGRAM" srun --jobid $SLURM_JOBID bash -c "$CMD" 2>&1 | tee -a $LOG_PATH echo "END TIME: $(date)"
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch'] torch version .................... 2.1.2+cu121 deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.12.5, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1 shared memory (/dev/shm) size .... 999.99 GB