microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.73k stars 283 forks source link

[MiniLLM] sft training loss of llama-7b did not decrease with multi nodes. #91

Closed cailinhang closed 1 year ago

cailinhang commented 1 year ago

I tried to use a simlar dataset alpaca-zh to sft the llama-7b on 16 x 32G v100 gpus. gpu_per_node=8 ,node_num=2.
The script I use is scripts/llama/sft/sft_7B.sh. But the training loss did not decrease if I use --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2.json". Even if I switch the learning rate and weight_decay, there is no difference. The train loss did not decrease, and the val rougeL score decrease with training.

So I switch to use only 8 gpu (one node) to sft llama-7b. I have to change the deepspeed config to train llama-7b on a node(8 gpus) because it will run out of memory if I still use the above deepspeed config. The new config I use to reduce memory is as follows:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "zero_force_ds_cpu_optimizer": false,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 11,
        "loss_scale_window": 5000,
        "hysteresis": 4
    },
    "wall_clock_breakdown": false
}

when I use only one single node (8 v100 gpus) to run this script. The training loss of llama-7b decrease normally.

Besides, the sft of gpt-base/ gpt-xl/opt-1.5b(trained on 8 gpus) are normal, but the sft of opt-13b(trained on 16 gpus) faced the same problem as llama-7b(trained on 16 gpus) .

So I guess this has something to do with the the multi node training.

donglixp commented 1 year ago

@cailinhang Is the issue proposed for MiniLLM?

donglixp commented 1 year ago

@t1101675

cailinhang commented 1 year ago

@cailinhang Is the issue proposed for MiniLLM?

yes, it is for minillm.

t1101675 commented 1 year ago

Thanks for reporting. I will check this.

t1101675 commented 1 year ago

@cailinhang Can you post the script to run the multi-node training?

cailinhang commented 1 year ago

run the multi-node training?

Sure. I made some changes and copy the scripts/llama/sft/sft_7B.sh to create a script called scripts/llama/sft/sft_7B_node1.sh to assign different NODE_RANK to different nodes to enable multi-node training.

The node 0 will runs sft_7B.sh and the node 1 will run sft_7B_node1.sh .

The following script is scripts/llama/sft/sft_7B_node1.sh . The only difference between two script is the NODE_RANK=0 for node 0 and NODE_RANK=1 for node 1. The model I use is the model provided in the project sft/llama-7B.

#! /bin/bash

#MASTER_ADDR=localhost
MASTER_ADDR=xxxxx # master ip of node 0
MASTER_PORT=${2-2012}
#NNODES=1
NNODES=2
NODE_RANK=1 # set the node rank 0 or 1
GPUS_PER_NODE=${3-8} 

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \
                  --nnodes $NNODES \
                  --node_rank $NODE_RANK \
                  --master_addr $MASTER_ADDR \
                  --master_port $MASTER_PORT"

# model
BASE_PATH=${1-"/home/MiniLLM"}

CKPT_NAME="llama-7B"
CKPT="${BASE_PATH}/checkpoints/${CKPT_NAME}/"

ckpt_base_path=/mnt/llama/train/

ckpt="sft/llama-7B"

ckpt=$ckpt_base_path"/"$ckpt
echo "ckpt=="$ckpt
CKPT=$ckpt

# data
#DATA_DIR="${BASE_PATH}/processed_data/dolly/full/llama/"
DATA_DIR="${BASE_PATH}/processed_data/alpaca_zh/full/llama/" # chinese alpaca training data
# hp
BATCH_SIZE=1
LR=0.00001

GRAD_ACC=2
EVAL_BATCH_SIZE=8
# length
MAX_LENGTH=512
# runtime
SAVE_PATH="${BASE_PATH}/results/llama/train/sft"
# seed
SEED=10
SEED_ORDER=10

OPTS=""
# model
OPTS+=" --base-path ${BASE_PATH}"
OPTS+=" --model-path ${CKPT}"
OPTS+=" --ckpt-name ${CKPT_NAME}"
OPTS+=" --n-gpu ${GPUS_PER_NODE}"
OPTS+=" --model-type llama"
OPTS+=" --gradient-checkpointing"
# data
OPTS+=" --data-dir ${DATA_DIR}"
OPTS+=" --num-workers 0"
OPTS+=" --dev-num 1000"
# hp
OPTS+=" --lr ${LR}"
OPTS+=" --batch-size ${BATCH_SIZE}"
OPTS+=" --eval-batch-size ${EVAL_BATCH_SIZE}"
OPTS+=" --gradient-accumulation-steps ${GRAD_ACC}"
OPTS+=" --warmup-iters 0"
OPTS+=" --lr-decay-style cosine"
#OPTS+=" --weight-decay 1e-2"
OPTS+=" --weight-decay 5e-2" # clh

OPTS+=" --clip-grad 1.0"
OPTS+=" --epochs 10"
# length
OPTS+=" --max-length ${MAX_LENGTH}"
OPTS+=" --max-prompt-length 256"
# runtime
OPTS+=" --do-train"
OPTS+=" --do-valid"
OPTS+=" --eval-gen"
OPTS+=" --save-interval -1"
OPTS+=" --eval-interval -1"
OPTS+=" --log-interval 4"
OPTS+=" --mid-log-num 1"
OPTS+=" --save ${SAVE_PATH}"
# seed
OPTS+=" --seed ${SEED}"
OPTS+=" --seed-order ${SEED_ORDER}"
# deepspeed
OPTS+=" --deepspeed"
OPTS+=" --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2.json" 

# type
OPTS+=" --type lm"
# gen
OPTS+=" --do-sample"
OPTS+=" --top-k 0"
OPTS+=" --top-p 1.0"
OPTS+=" --temperature 1.0"

export NCCL_DEBUG=""
export NCCL_IB_GID_INDEX=3
export WANDB_DISABLED=True
export TF_CPP_MIN_LOG_LEVEL=3
export PYTHONPATH=${BASE_PATH}
CMD="torchrun ${DISTRIBUTED_ARGS} ${BASE_PATH}/finetune.py ${OPTS} $@"

echo ${CMD}
echo "PYTHONPATH=${PYTHONPATH}"
mkdir -p ${SAVE_PATH}
CODE_BASE=HF ${CMD}

As mentioned above, for single-node (8 v100 gpu) training of llama-7b, I had to add

"offload_optimizer": {
            "device": "cpu"
        }

in the deepspeed config to reduce memory. The training loss decrease normally from 1.8 to 1.1~1.3 in the epoch 0, and then decrease further. And the eval rougeL is normal. But when I use 2 nodes (16 v100 gpu) , the training loss did not seem to decrease at all. The training and eval data I used is from https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data-zh. I randomly sample 15k training data and 500 val data from it.

LMOps/minillm/data/alpaca_zh]# ll
raw.jsonl 
valid.jsonl

I rename jsonl with json because github does not support uploading jsonl files.

Besides, to read or write from json files with chinese, I had to use encoding='utf-8-sig' when I read json and ensure_ascii=False when I use json.dump to avoid UTF-8 error.

with open(os.path.join(eval_dir, "answers.jsonl"), "w", encoding='utf-8-sig') as f:
                for resp in responses:
                    f.write(json.dumps({"text": resp}, ensure_ascii=False) + "\n") 

the data alpaca-zh I used : raw.json valid.json

cailinhang commented 1 year ago

And to process the alpaca-zh data, there are some codes you need.

The processing of alpaca-zh is similar to that of dolly. process_data_alpaca_zh.py and process_data_alpaca_zh.sh

process_alpaca_zh.zip

t1101675 commented 1 year ago

I see. The problem may be caused by the way multi-node training is launched. Using torchrun only cannot start valid multi-node training. Generally, we can start the training by deepspeed. We have uploaded an example file to run multi-node training which is verified in our machine with 2*16 32GV100. More instructions can be found in the README. You can try this script and please let me know if you encounter any problems.

cailinhang commented 1 year ago

I see. The problem may be caused by the way multi-node training is launched. Using torchrun only cannot start valid multi-node training. Generally, we can start the training by deepspeed. We have uploaded an example file to run multi-node training which is verified in our machine with 2*16 32GV100. More instructions can be found in the README. You can try this script and please let me know if you encounter any problems.

I tried this new script scripts/llama/sft/sft_7B_mn.sh to train llama-7b on 2 x 8 v100 gpus. however, the training loss still does not decrease. the training loss still fluctuate from 1.5~2.0 as before.
Besides, the single node training of llama-7b finished, and the training loss decrease to 0.6 during 10 epochs training.

In my machines, I should add export NCCL_IB_GID_INDEX=3 to the sft_7B_mn.sh to avoid nccl connection error among different nodes. One phenomenon that may not matter is that when I use the new script, there will be a key error that CODE_BASE is not the in the os.environ of finetune.py, so I comment this line to avoid error.


if __name__ == "__main__":
    #print(os.environ["CODE_BASE"])
    main()

Here are the log files of my traning on alpaca-zh.
log_llama_7b_sft_16gpu_loss_no_decrease.txt log_llama7b_sft_8gpu.txt

cailinhang commented 1 year ago

When I used the scripts/llama/sft/sft_7B_mn.sh with deepspeed config

"offload_optimizer": {
            "device": "cpu"
        }

The multi-node training of llama-7b seems normal. So does llama-13b. The rouge of the eval set gradully increases.

The following image is the training loss of the llama-13b finetuning on alpaca-zh. The sharp decrease of the training loss indicates the end of one training epoch.

.

Still, when I use cpu-offload, the training is much slower. For sft on llama-13b, it tooks about 10 hours per epoch with 2 x 8 gpus. (The size of finetuning data is about 15k, which is not huge). I guess the minillm phase would be much slower.

Maybe lora can be used to accelerate finetuning. But I find it not easy to apply PEFT directly to this project. Could you please give us some advice or examples on how to add lora to minillm?

t1101675 commented 1 year ago

When I used the scripts/llama/sft/sft_7B_mn.sh with deepspeed config

"offload_optimizer": {
            "device": "cpu"
        }

The multi-node training of llama-7b seems normal. So does llama-13b. The rouge of the eval set gradully increases.

The following image is the training loss of the llama-13b finetuning on alpaca-zh. The sharp decrease of the training loss indicates the end of one training epoch.

.

Still, when I use cpu-offload, the training is much slower. For sft on llama-13b, it tooks about 10 hours per epoch with 2 x 8 gpus. (The size of finetuning data is about 15k, which is not huge). I guess the minillm phase would be much slower.

Maybe lora can be used to accelerate finetuning. But I find it not easy to apply PEFT directly to this project. Could you please give us some advice or examples on how to add lora to minillm?

We have added an example of applying LoRA for training. More contributions are welcome!