jzhang38 / EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
Apache License 2.0
529 stars 33 forks source link

Model stopped updating after 300-400 steps. #21

Closed Bostoncake closed 1 week ago

Bostoncake commented 2 months ago

I am training with the following script:

export PYTORCH_CUDA_ALLOC_CONF='max_split_size_mb:1024'

accelerate launch \
--config_file  accelerate_configs/single_node.yaml \
train.py \
--batch-size 1 \
--gradient-accumulate-every 4 \
--output-dir ./output/7B_32K_bs_1M_rope_1M_step_1000_lr_2e-5 \
--wandb EasyContext \
--max-train-steps 2500  \
--learning-rate 2e-5  \
--dataset yaofu/slimpajama-per-source-length-upsample \
--model meta-llama/Llama-2-7b-hf  \
--seq-length 32768 \
--rope-theta 500000 \
--parallel_mode data_parallel

The difference between the above command and train_scripts/EasyContext-1M-Llama-2-7B.sh is that I changed the --max-train-steps and --rope-theta. Additionally, I modified the if block in Line 161-165 in train.py to enable model saving every 100 steps (I set --save_interval=100):

if accelerator.sync_gradients:
    progress_bar.update(1)
    if loss_log is not None:
        progress_bar.set_postfix(loss_log)
    completed_steps += 1

    if completed_steps % args.save_interval == 0:
        ckpt_save_dir = f"{args.output_dir}/step{completed_steps}"
        os.makedirs(ckpt_save_dir, exist_ok=True)
        accelerator.wait_for_everyone()

        state_dict = accelerator.get_state_dict(model)

        accelerator.unwrap_model(model).save_pretrained(
            f"{ckpt_save_dir}",
            is_main_process=accelerator.is_main_process,
            save_function=accelerator.save,
            state_dict=state_dict,
        )

        accelerator.print(f"Saved model to {ckpt_save_dir}")

        accelerator.wait_for_everyone()

All saved models are evaluated on the latest version of lm_eval. I found that all models saved after step 400 (step 400 included) are identical. That is, when checked with cmp <model_at_step_400-0000X-of-0000X.safetensors> <model_at_step_T(T>400)-0000X-of-0000X.safetensors>, no errors are given. Besides, when evaluated on lm_eval, these models give identical results on all datasets tested (including MMLU, TQA, Hellaswag, Winogrande, etc.).

The models are all trained on 8 A800 GPUs (80G) and this issue can be reproduced on different model structures (YARN, which is LLaMA-2 with a different positional embedding). I wonder if you have any insights towards this issue. Thanks!

Bostoncake commented 2 months ago

An update: I tried with --max-train-steps 1000, it seems that the model stops updating somewhere between step 100 and 200. Checkpoints saved after 200 (200 included) are identical when checked with the cmp command.

Bostoncake commented 2 months ago

An update: It seems that it has something to do with the scheduler in accelerate_configs/zero3_offload.json. I set --max-train-steps 100 and print the learning rate value in each iteration. I get the following result: rank7_input.log

As you can see, the learning rate decays to 0 ever since step 13. So nearly 90% of the training process is useless.

Bostoncake commented 2 months ago

I suggest use a constant schedule following arXiv:2402.10171. Which is, set accelerate_configs/zero3_offload.json as the following:

{
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 2e-5,
      "warmup_max_lr": 2e-5,
      "warmup_num_steps": 0,
      "warmup_type": "linear"
    }
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": [0.9, 0.95],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },
}
jzhang38 commented 2 months ago

Thanks for spotting this out!! This is really weird...

I can verify on my end that the lr indeed becomes zero after several steps ...

I mainly follow Yarn's training and its deepspeed config and did not expect this would happen.

I thought arXiv:2402.10171 also used a WarmupDecayLR ? https://github.com/FranxYao/Long-Context-Data-Engineering/issues/8#issuecomment-2003917697

jzhang38 commented 2 months ago

@Bostoncake I've updated the config based on your suggestions. Do you have any idea why this could happen?

Bostoncake commented 2 months ago

@Bostoncake I've updated the config based on your suggestions. Do you have any idea why this could happen?

I think it might has something to do with "total_num_steps": "auto". Maybe you need to explicitly specify the number of total steps when initializing the scheduler. I haven't tried this, though.

HaoshengZou commented 1 month ago

@jzhang38 @Bostoncake It is because scheduler.step() is called num_gpu times for each step. So when --max-train-steps 1000, lr decays to 0 at 125 steps with 8 GPUs.

For now the best solution IMHO is indeed to specify min_lr in zero3_offload.json instead of using auto.

LzhinFdu commented 1 month ago

@jzhang38 @Bostoncake It is because scheduler.step() is called num_gpu times for each step. So when --max-train-steps 1000, lr decays to 0 at 125 steps with 8 GPUs.

For now the best solution IMHO is indeed to specify min_lr in zero3_offload.json instead of using auto.

+1, when I use 2*A100, the model parameters are no longer updated starting from max_train_step/2.