tongxuluo / prts

https://llm-stacking.github.io/
Apache License 2.0
32 stars 1 forks source link

Prts config examples #2

Open spliew opened 3 days ago

spliew commented 3 days ago

Thanks for the great work! Could you provide more details on how to set the configs to run experiments on different growth operators as described in the paper (other than G_stack)?

tongxuluo commented 2 days ago

Thank you for your interests!

An example of G_learn operator config:

{
    "src_config_name": "6L2048H",
    "trg_config_name": "24L2048H"
}

An example script:

srun python pretrain/run_pretrain.py \
    --num_nodes=2 \
    --model_name=24L2048H \
    --name=24L2048H \
    --method=ligo \
    --config_path=/path/to/your/config/ligo_6L_24L.json \
    --out_dir=/path/to/your/out_dir \
    --train_data_dir=/path/to/your/training_data \
    --src_init_path=/path/to/your/source_model.pth \
    --devices=8 \
    --global_batch_size=1024 \
    --learning_rate=7e-5 \
    --min_lr=7e-6 \
    --micro_batch_size=8 \
    --max_step=100 \
    --warmup_steps=50 \
    --log_step_interval=1 \
    --eval_iters=10000 \
    --save_step_interval=100 \
    --eval_step_interval=100000 \
    --weight_decay=1e-1 \
    --beta1=0.9 \
    --beta2=0.95 \
    --grad_clip=1.0 \
    --decay_lr=True

G_learn will train a hyper net to transfer the source model to the target model. Next, you must continually pretrain your model after G_learn:

srun python pretrain/run_pretrain.py \
    --num_nodes=16 \
    --model_name=24L2048H \
    --name=24L2048H \
    --method=scratch \
    --out_dir=/path/to/your/out_dir \
    --train_data_dir=/path/to/your/training_data \
    --src_init_path=/path/to/your/model_after_G_learn.pth \
    --devices=8 \
    --global_batch_size=1024 \
    --learning_rate=3e-4 \
    --min_lr=3e-5 \
    --micro_batch_size=8 \
    --max_step=300000 \
    --warmup_steps=3000 \
    --log_step_interval=1 \
    --eval_iters=10000 \
    --save_step_interval=5000 \
    --eval_step_interval=5000 \
    --weight_decay=1e-1 \
    --beta1=0.9 \
    --beta2=0.95 \
    --grad_clip=1.0 \
    --decay_lr=True

Moreover, an example config of G_zero in widthwise:

{
    "src_config_name": "24L1024H",
    "trg_config_name": "24L2048H",
    "src_path": "/path/to/your/source_model.pth"
}

The running following script to start your pretraining:

srun python pretrain/run_pretrain.py \
    --num_nodes=4 \
    --model_name=24L2048H \
    --name=24L2048H \
    --method=zero \
    --config_path=/path/to/your/config/zero_1024H_2048H.json \
    --out_dir=/path/to/your/out_dir \
    --train_data_dir=/path/to/your/training_data \
    --src_init_path=/path/to/your/source_model.pth \
    --devices=8 \
    --global_batch_size=1024 \
    --learning_rate=3e-4 \
    --min_lr=3e-5 \
    --micro_batch_size=8 \
    --max_step=300000 \
    --warmup_steps=3000 \
    --log_step_interval=1 \
    --eval_iters=10000 \
    --save_step_interval=5000 \
    --eval_step_interval=5000 \
    --weight_decay=1e-1 \
    --beta1=0.9 \
    --beta2=0.95 \
    --grad_clip=1.0 \
    --decay_lr=True \

We apologize for not integrating depthwise G_zero yet. Currently, depthwise G_zero is still a standalone Python script. You can run G_zero_depthwise.py with lemon=True to apply depthwise G_zero to a Hugging Face model. After obtaining the weights, you can use the script scripts/convert_hf_checkpoint.py to convert them into the lit-gpt format for continual pretraining.

For G_random, an example config is:

{
    "src_config_name": "6L2048H",
    "trg_config_name": "24L2048H",
    "src_path": "/path/to/your/source_model.pth",
    "grow_step": 5000
}

You can run your pretraining by following script:

srun python pretrain/run_pretrain.py \
    --num_nodes=16 \
    --model_name=24L2048H \
    --name=24L2048H \
    --method=msg \
    --config_path=/path/to/your/config/msg_6L_24L.json \
    --out_dir=/path/to/your/out_dir \
    --train_data_dir=/path/to/your/training_data \
    --devices=8 \
    --global_batch_size=1024 \
    --learning_rate=3e-4 \
    --min_lr=3e-5 \
    --micro_batch_size=8 \
    --max_step=300000 \
    --warmup_steps=3000 \
    --log_step_interval=1 \
    --eval_iters=10000 \
    --save_step_interval=5000 \
    --eval_step_interval=5000 \
    --weight_decay=1e-1 \
    --beta1=0.9 \
    --beta2=0.95 \
    --grad_clip=1.0 \
    --decay_lr=True