[x] Loading different tensor parallel size for ZeRO-0 (no need to support data parallel size, since it just replicates the first model replicas regardless of data parallel size)
[x] Loading different tensor parallel and data parallel size for ZeRO-1
I started training a model with tp=4, dp=2, pp=1 for 1000 steps (call this the first config), then resumed training from the first config's checkpoint at iteration 20 (by changing the checkpoint's latest.txt to 20) with a new config tp=2, dp=2, pp=1 (call this the second config), continued to 1000 steps. I observed same training losses for zero0 and zero1 respectively.
#!/bin/bash
# Define the output file
NANOTRON_DIR="/fsx/phuc/projects/nanotron"
ZERO_STAGE=${1#*=}
if [[ $ZERO_STAGE != 0 && $ZERO_STAGE != 1 ]]; then
echo "Invalid or no --zero_stage argument provided. Please use --zero_stage=0 or --zero_stage=1."
exit 1
fi
if [ $ZERO_STAGE -eq 0 ]; then
CKP_SAVED_PATH="/fsx/phuc/checkpoints/nanotron-optim-loading/no_zero1_dp_2_tp4_pp1"
CONTINUE_AT_ITERATION=20
CKP_CONFIG_PATH="$NANOTRON_DIR/downloads/debug_optim/zero0/config_tiny_llama_dp_2_tp4_pp1_with_no_zero.yaml"
CONTINUE_CONFIG_PATH="$NANOTRON_DIR/downloads/debug_optim/zero0/config_tiny_llama_dp_2_tp2_pp1_with_no_zero.yaml"
else
CKP_SAVED_PATH="/fsx/phuc/checkpoints/nanotron-optim-loading/zero1/zero1_dp2_tp4_pp1"
CONTINUE_AT_ITERATION=20
CKP_CONFIG_PATH="$NANOTRON_DIR/downloads/debug_optim/zero1/config_llama_dp2_tp4_pp1_with_zero1.yaml"
CONTINUE_CONFIG_PATH="$NANOTRON_DIR/downloads/debug_optim/zero1/config_llama_dp2_tp2_pp1_with_zero1.yaml"
fi
OUTPUT_FILE="training_output_zero_stage_$ZERO_STAGE.log"
# First command - Generate a checkpoint
echo "Running checkpoint generation with dp=2, tp=4, pp=1" | tee -a $OUTPUT_FILE
USE_FAST=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 $NANOTRON_DIR/run_train.py --config-file $CKP_CONFIG_PATH 2>&1 | tee -a $OUTPUT_FILE
# Check if the previous command was successful
if [ $? -eq 0 ]; then
echo "Checkpoint generation successful, proceeding to continue training." | tee -a $OUTPUT_FILE
else
echo "Checkpoint generation failed, aborting script." | tee -a $OUTPUT_FILE
exit 1
fi
# Now we modify the checkpoint's latest to $CONTINUE_AT_ITERATION
# so that we can continue training from that iteration and compare
# the training losses between the two runs
echo "Modifying the checkpoint's latest to $CONTINUE_AT_ITERATION" | tee -a $OUTPUT_FILE
echo $CONTINUE_AT_ITERATION > $CKP_SAVED_PATH/latest.txt
# Second command - Continue training from the checkpoint
echo "Continuing training from the checkpoint with dp=2 tp=2 pp=1" | tee -a $OUTPUT_FILE
USE_FAST=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 $NANOTRON_DIR/run_train.py --config-file $CONTINUE_CONFIG_PATH 2>&1 | tee -a $OUTPUT_FILE
# Check if the previous command was successful
if [ $? -eq 0 ]; then
echo "Training continuation successful." | tee -a $OUTPUT_FILE
else
echo "Training continuation failed." | tee -a $OUTPUT_FILE
fi
I started training a model with tp=4, dp=2, pp=1 for 1000 steps (call this the first config), then resumed training from the first config's checkpoint at iteration 20 (by changing the checkpoint's
latest.txt
to 20) with a new config tp=2, dp=2, pp=1 (call this the second config), continued to 1000 steps. I observed same training losses for zero0 and zero1 respectively.Reproduce script
./fsx/phuc/projects/nanotron/downloads/debug_optim/test_loading_optimizer.sh --zero_stage=0
./fsx/phuc/projects/nanotron/downloads/debug_optim/test_loading_optimizer.sh --zero_stage=1
Training logs
ZeRO-0: dp=2, tp=4, pp=1
ZeRO-0: dp=2, tp=2, pp=1
ZeRO-1: dp=2, tp=4, pp=1's logs
ZeRO-1: dp=2, tp=2, pp=1's logs
Log files
training_output_zero_stage_0.log
training_output_zero_stage_1.log