microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.12k stars 4.07k forks source link

[BUG]Some issues in the Twin-Flow feature provided by Zero-offload ++ #4775

Closed BingxuZhu closed 10 months ago

BingxuZhu commented 10 months ago

Hello, thank you for your contribution to twin-offload. When I tried to run ds_pretrain_gpt_2.7B.sh at Megatron-Deepspeed with the latest parameter "offload_optimizer":"ratio", I tried to set the parameter value from 0.0 to 1.0. I found that when it was training, the CPU Virtual Memory was the same when the ratio parameter was set from 0.0 to 0.4, and the cpu usage was the same when the ratio parameter was set from 0.5 to 1.0. Here's what happened with the scripts and arguments I used and cpu usage.

#ds_config_gpt_TEMPLATE.json

{
  "train_batch_size" : CONFIG_BATCH_SIZE,
  "train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
  "steps_per_print": LOG_INTERVAL,

  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true,
      "ratio": 0.1
    }
  },

  "gradient_clipping": 1.0,
  "prescale_gradients":false,

  "fp16": {
    "enabled": CONFIG_FP16_ENABLED,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 11
  },

  "bf16": {
    "enabled": CONFIG_BF16_ENABLED
  },

  "wall_clock_breakdown" : false
}

If I set the ratio parameter to 0.0, 0.1, 0.2, 0.3, 0.4, the CPU Virtual Memory in the output log is about 51GB, and percent is about 27%. It seems that cpu memory decreases After initializing optimizer states. Why?

[2023-12-05 20:47:44,452] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2023-12-05 20:47:44,453] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.72 GB         Max_CA 1 GB 
[2023-12-05 20:47:44,453] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.79 GB, percent = 18.6%
[2023-12-05 20:47:45,070] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2
[2023-12-05 20:47:45,071] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.63 GB         Max_CA 1 GB 
[2023-12-05 20:47:45,072] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 39.22 GB, percent = 20.9%
[2023-12-05 20:47:45,151] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2023-12-05 20:47:45,151] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.63 GB         Max_CA 1 GB 
[2023-12-05 20:47:45,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,222] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:47:45,223] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB         Max_MA 2.51 GB         CA 2.51 GB         Max_CA 3 GB 
[2023-12-05 20:47:45,224] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,595] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:47:45,596] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB         Max_MA 1.88 GB         CA 2.51 GB         Max_CA 3 GB 
[2023-12-05 20:47:45,597] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 52.16 GB, percent = 27.8%
[2023-12-05 20:47:47,492] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:47:47,493] [INFO] [utils.py:803:see_memory_usage] MA 5.65 GB         Max_MA 5.65 GB         CA 6.27 GB         Max_CA 6 GB 
[2023-12-05 20:47:47,493] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 51.08 GB, percent = 27.3%
[2023-12-05 20:47:47,493] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:47:47,988] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:47:47,989] [INFO] [utils.py:803:see_memory_usage] MA 6.58 GB         Max_MA 6.64 GB         CA 7.21 GB         Max_CA 7 GB 
[2023-12-05 20:47:47,989] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 51.1 GB, percent = 27.3%

Similarly, when I set the ratio parameter to 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, the CPU Virtual Memory in the output log is about 89GB, the percent is about 47%

[2023-12-05 20:24:23,817] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:24:23,819] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:23,819] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 44.74 GB, percent = 23.9%
[2023-12-05 20:24:24,362] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:24:24,363] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:24,364] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 63.86 GB, percent = 34.1%
[2023-12-05 20:24:27,146] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:24:27,148] [INFO] [utils.py:803:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:27,148] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 81.11 GB, percent = 43.3%
[2023-12-05 20:24:27,162] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:24:29,339] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:24:29,340] [INFO] [utils.py:803:see_memory_usage] MA 1.57 GB         Max_MA 1.63 GB         CA 1.64 GB         Max_CA 2 GB 
[2023-12-05 20:24:29,340] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 89.22 GB, percent = 47.6%
#ds_pretrain_gpt_2.7B.sh

#!/bin/bash
DIR=`pwd`
SEQ_LEN=2048

MODEL_SIZE=2.7
NUM_LAYERS=32
HIDDEN_SIZE=2560
NUM_ATTN_HEADS=32
GLOBAL_BATCH_SIZE=512
LR=1.6e-4
MIN_LR=1.6e-5

TRAIN_TOKENS=300000000000

TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))

EXIT_DURATION=30000000

WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000

BATCH_SIZE=2

MP_SIZE=8

PP_SIZE=1
NUM_GPUS=8

EP_SIZE=1

***.........default config

TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR} 

CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"

VOCAB_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-vocab.json
MERGE_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/BookCorpusDataset_text_document/BookCorpusDataset_text_document

###############################################################################
data_options=" \
         --vocab-file ${VOCAB_PATH} \
         --merge-file ${MERGE_PATH} \
         --data-path ${DATA_BLEND} \
         --data-impl mmap"

megatron_options=" \
        --override-opt_param-scheduler \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --tensor-model-parallel-size ${MP_SIZE} \
        --moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
        --num-experts ${EP_SIZE} \
        --moe-loss-coeff ${MLC} \
        --moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
        --moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
        --moe-min-capacity ${MOE_MIN_CAP} \
        --init-method-std ${INIT_STD} \
        --lr-decay-tokens ${LR_DECAY_TOKENS} \
        --lr-warmup-tokens ${WARMUP_TOKENS} \
        --micro-batch-size ${BATCH_SIZE} \
        --exit-duration-in-mins ${EXIT_DURATION} \
        --rampup-batch-size 32 32 1953125 \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
        --num-layers ${NUM_LAYERS} \
        --hidden-size ${HIDDEN_SIZE} \
        --num-attention-heads ${NUM_ATTN_HEADS} \
        --seq-length ${SEQ_LEN} \
        --max-position-embeddings ${SEQ_LEN} \
        --train-tokens ${TRAIN_TOKENS} \
        --train-samples ${TRAIN_SAMPLES} \
        --lr ${LR} \
        --min-lr ${MIN_LR} \
        --lr-decay-style cosine \
        --split 98,2,0 \
        --log-interval ${LOG_INTERVAL} \
        --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
        --save-interval ${SAVE_INTERVAL} \
        --weight-decay 0.1 \
        --clip-grad 1.0 \
        --hysteresis 2 \
        --num-workers 0 \
        --fp16 \
        --load ${CHECKPOINT_PATH} \
        --save ${CHECKPOINT_PATH} \
        --tensorboard-queue-size 1 \
        --log-timers-to-tensorboard \
        --timing-log-level 1 \
        --no-pipeline-parallel \
        --cpu-optimizer \
        --distributed-timeout-minutes 60 \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
        --tensorboard-dir ${TENSORBOARD_DIR}"

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
        --checkpoint-activations"
fi

if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
        --create-moe-param-group"
fi

if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
        --disable-moe-token-dropping"
fi

template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
    | sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
    | sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
    | sed "s/ZERO_STAGE/3/" \
    | sed "s/PRESCALE_GRAD/true/" \
    | sed "s/CONFIG_FP16_ENABLED/false/" \
    | sed "s/CONFIG_BF16_ENABLED/true/" \
    | sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
    | sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
    | sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
    | sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
      > ${config_json}

deepspeed_options=" \
            --deepspeed \
            --deepspeed_config ${config_json} \
            --pipeline-model-parallel-size ${PP_SIZE}"

# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
        --no-pipeline-parallel"
fi

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
        --deepspeed-activation-checkpointing"
fi

run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

For the ds_pretrain_gpt_2.7B.sh script: Compared with the 350M.sh script of Zero-offload ++ Tutorials in offload_pp directory, I only changed its model size and some necessary dataset configuration. I don't know why this happened. I am eager to use Twin-Flow partial offload function, hope you can answer me, thank you

This is my lab environment: Tesla V100-SXM2-16GB * 8, Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, CPU total Memory 187Gb. Deepspeed0.12.4

GuanhuaWang commented 10 months ago

Hi @BingxuZhu ,

Thx for your analysis on the output log. Actually it is not our code bug, but coarse measurement of CPU virtual memory usage.

Basically, we report CPU virtual memory usage using psutil python package as the code line here

psutil.virtual_memory() monitor Global CPU virtual memory usage, Not only our deepSpeed single process memory usage, thus the measurement is too coarse for you to collect deepspeed process's CPU virtual memory usage. And that is why you see in your log above,Before initializing optimizer states there is already around 20% cpu virtual memory usage.

Hope it answers your questions.

BingxuZhu commented 10 months ago

I really appreciate your reply @GuanhuaWang thank you so much!

So is there a better solution, or fine-grained monitoring of deepspeed processes?

BingxuZhu commented 10 months ago

Hi @GuanhuaWang,

I found an error that when I was testing the ratio parameter, using the most basic examples provided by Huggingface transformers(which provides the Deepspeed integration link here)

When the ratio parameter is set from 0.0 to 0.9, each GPU can run to full memory size and the training takes about 2 minutes. When the ratio parameter is set to 1.0, the each GPU memory is only about 60% and the training takes about 12 minutes. Here is my script to execute it.

train.sh

#!/bin/bash
deepspeed --hostfile=hostfile --num_nodes=1 --num_gpus 8   examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-3b --per_device_train_batch_size 1 \
--output_dir output_dir1 --overwrite_output_dir --fp16 \
--do_train --max_train_samples 300 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " \
--source_lang en --target_lang ro \
--learning_rate 5e-7

ds_config_zero3.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 5e-7,
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.0      ##only change it
        }

    },
    "prescale_gradients": false,

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

This is an unreasonable phenomenon for the offload-ratio parameter function. Although you mentioned above that the psutil python package is a coarse-grained measure of the Node's CPU virtual memory, the training time is indeed a contradictory phenomenon. Why the offload ratio parameter not work? The expected effect should be that different ratios correspond to different CPU virtual memory usage and GPU memory size.

I'm guessing that the Huggingface Deepspeed integration doesn't work well with the latest version of deepspeed library. Can Twin-Offload be used only by Megatron-Deepspeed?