Open Mr-lonely0 opened 1 month ago
Since the final EE layer is located in Stage 2, subsequent pipeline stages do not contain an EE layer, hence there are no parameters to optimize in these later stages. You only need to set tune_exit_pipeline_parallel_size
to 2 to address this issue.
Additionally, bear in mind that after fine-tuning with the aforementioned approach, your final output files will only contain parameters from the first two pipeline stages. You will need to manually merge the parameter folders of the last two pipeline stages from the original checkpoint path with the folders of the first two stages generated by the fine-tuning process to obtain a complete checkpoint.
I modified the script llama2_7B_1_exit_mlp_pt.sh
as you said(set tune_exit_pipeline_parallel_size
to 2), but I'm still encountering the same error. Could you provide more details?
Looking for your early response :)
#!/bin/bash
PROJECT_NAME=EE-TUNE
GROUP_NAME=llama-2-17B-chat-1-EXIT-pt
CURRENT_TIME=`date "+%m%d-%H%M"`
MASTER_NAME=${CURRENT_TIME}
export CUDA_DEVICE_MAX_CONNECTIONS=1
export OMP_NUM_THREADS=4
# Checkpoint configuration
LOAD_PATH=/data3/lk/EE-LLM/model/ee_llm_format/llama-2-7b-chat # your checkpoint path
TOKENIZER_PATH=/data3/lk/llm/model/Llama-2-7b-chat-hf/tokenizer.model # your tokenizer path
CHECKPOINT_PATH=/data3/lk/EE-LLM/model/checkpoints # checkpoint save path
# Data configuration
DATA_HOME=
DATASET_ARXIV=${DATA_HOME}/redpajama-arxiv/all
DATASET_BOOKS=${DATA_HOME}/redpajama-book/all
DATASET_C4=${DATA_HOME}/redpajama-c4/all
DATASET_CC=${DATA_HOME}/redpajama-cc/all
DATASET_STACKEXCHANGE=${DATA_HOME}/redpajama-pile-stackexchange/all
DATASET_CODE=${DATA_HOME}/redpajama-stack-code/all
DATASET_WIKIPEDIA=${DATA_HOME}/redpajama-wiki/all
DATASET_PILE_EUROPARL=${DATA_HOME}/the-pile-europarl/all
DATASET_PILE_FREELAW=${DATA_HOME}/the-pile-freelaw/all
DATASET_PILE_HACKERNEWS=${DATA_HOME}/the-pile-hackernews/all
DATASET_PILE_NIH=${DATA_HOME}/the-pile-nih/all
DATASET_PILE_PHILPAPER=${DATA_HOME}/the-pile-philpaper/all
DATASET_PILE_PMA=${DATA_HOME}/the-pile-pubmed-abstract/all
DATASET_PILE_PMC=${DATA_HOME}/the-pile-pubmed-central/all
DATASET_PILE_USPTO=${DATA_HOME}/the-pile-uspto/all
DATA_PATH="\
0.0362 ${DATASET_ARXIV} \
0.0657 ${DATASET_BOOKS} \
0.2264 ${DATASET_C4} \
0.4491 ${DATASET_CC} \
0.0246 ${DATASET_STACKEXCHANGE} \
0.0810 ${DATASET_CODE} \
0.0548 ${DATASET_WIKIPEDIA} \
0.0010 ${DATASET_PILE_EUROPARL} \
0.0162 ${DATASET_PILE_FREELAW} \
0.0006 ${DATASET_PILE_HACKERNEWS} \
0.0005 ${DATASET_PILE_NIH} \
0.0006 ${DATASET_PILE_PHILPAPER} \
0.0065 ${DATASET_PILE_PMA} \
0.0318 ${DATASET_PILE_PMC} \
0.0050 ${DATASET_PILE_USPTO} \
"
NLAYERS=32
HIDDEN=4096
HEADS=32
SEQ=2048
FFN_SIZE=11008
TP=1
PP=4 # Set pipeline model parallel size to 1
MICRO_BATCH=4 # Reduce batch size for single GPU
GLOBAL_BATCH=16
MASTER_ADDR=127.0.0.1
MASTER_PORT=5901
WORLD_SIZE=1
RANK=0
NPROC_PER_NODE=4 # Set number of processes per node to 1
TRAIN_ITER=40000
EVAL_INTERVAL=50000
SAVE_INTERVAL=20000
DIST_ARGS="
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nproc_per_node $NPROC_PER_NODE \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
"
GPT_ARGS="
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
--query-key-layer-scaling \
--num-layers $NLAYERS \
--hidden-size $HIDDEN \
--num-attention-heads $HEADS \
--seq-length $SEQ \
--max-position-embeddings $SEQ \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--lr 0.0001 \
--train-iters $TRAIN_ITER \
--min-lr 1.0e-5 \
--lr-warmup-fraction .01 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-5 \
--clip-grad 1.0 \
--bf16 \
--disable-bias-linear \
--use-flash-attn \
--normalization RMSNorm \
--position-embedding-type rope \
--swiglu \
--untie-embeddings-and-output-weights \
--padded-vocab-size 32000 \
--ffn-hidden-size $FFN_SIZE \
--finetune \
--tune-exit \
--untie-exit-output-weights \
--use-exit-norm \
--use-exit-mlp \
--tune-exit-pipeline-parallel-size 2 \
--exit-layer-nums 10 \
"
DATA_ARGS="
--data-path $DATA_PATH \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model $TOKENIZER_PATH \
--split 990,9,1 \
"
# OUTPUT_ARGS_BAK="
# --log-interval 10 \
# --log-timers-to-tracker \
# --save-interval $SAVE_INTERVAL \
# --eval-interval $EVAL_INTERVAL \
# --eval-iters 1 \
# --wandb-project $PROJECT_NAME \
# --wandb-group $GROUP_NAME \
# --wandb-exp-name $MASTER_NAME \
# "
OUTPUT_ARGS="
--log-interval 10 \
--log-timers-to-tracker \
--save-interval $SAVE_INTERVAL \
--eval-interval $EVAL_INTERVAL \
--eval-iters 1 \
"
CUR_DIR=$(cd $(dirname "$0") && pwd)
MEGATRON_ROOT_PATH=$(cd "$CUR_DIR/../../.." && pwd)
cd $MEGATRON_ROOT_PATH
torchrun $DIST_ARGS \
pretrain_early_exit_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--load $LOAD_PATH \
--save $CHECKPOINT_PATH
After investigation, this indeed is a bug, and we will address it in future updates. The bug arises when using --exit-layer-nums 10
because only Stage 2 contains optimizable EE parameters, while all other pipeline stages do not. This error occurs even when --tune-exit-pipeline-parallel-size 2
is added, because Stage 1 still lacks optimizable EE parameters. There are two possible solutions:
--tune-exit-pipeline-parallel-size 2
. --tune-exit-pipeline-parallel-size 1
, ensuring all pipeline stages have optimizable EE layer parameters.Thanks a lot!! I'll try tune all EE layers instead of one, and I'm looking forward to your future updates. Your work is truly commendable!
Describe the bug I use llama-2 7b, and when I start stage 2 in EE-Tuning, the bug occurs.
To Reproduce here is
llama2_7B_1_exit_mlp_pt.sh
I modified:Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs
Environment (please complete the following information):