linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training
https://arxiv.org/pdf/2410.10989
BSD 2-Clause "Simplified" License
3.41k stars 195 forks source link

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Open Se-Hun opened 1 month ago

Se-Hun commented 1 month ago

🐛 Describe the bug

I am trying to instruction tuning Qwen2.5-14B-Instruct with Liger Kernel.

I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?

Reproduce

Python Code Example :

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-14B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

...

trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
)
trainer.train()

Run Example :

deepspeed --include localhost:0,1 --master_port 61000 train.py \
    --learning_rate=1e-5 \
    --lr_scheduler_type=cosine \
    --max_length=8192 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --evaluation_strategy=no \
    --num_train_epochs=3 \
    --save_strategy=epoch \
    --logging_strategy=steps \
    --logging_steps=1 \
    --save_total_limit=1 \
    --remove_unused_columns=False \
    --dataloader_num_workers=16 \
    --warmup_ratio=0.03 \
    --gradient_checkpointing=True \
    --torch_compile=True \
    --optim=adafactor \
    --bf16 \
    --deepspeed=./config/zero3.json \
    --use_liger_kernel=True

Versions

Environment Report:

Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35 Python version: 3.10.14 PyTorch version: 2.4.0+cu121 CUDA version: 12.1 Triton version: 3.0.0 Transformers version: 4.45.0.dev0

tyler-romero commented 1 month ago

Can you update your code example with how you're applying LigerKernel?

tyler-romero commented 1 month ago

Fwiw, Qwen2.5 uses the same model architecture as Qwen2 so Liger should still work correctly: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json#L3

Se-Hun commented 1 month ago

@tyler-romero Thank you for quickly response. I simply used Liger through the --use_liger_kernel=True option in the Huggingface trainer. While it is true that Qwen-2.5 uses the same architecture as Qwen-2, using Liger did not result in a decrease in loss value for Qwen-2.5. When training Qwen-2.5 without using Liger, the loss value decreased effectively.

chiwanpark commented 1 month ago

@Se-Hun cc @tyler-romero This is the same issue with #268; the monkey patch methods to an already instanciated model do not copy the weights of the original model. HF trainer and TRL SFTrainer relies on the methods, while axolotl does not. You may use axolotl until the issue is fixed.

Arcmoon-Hu commented 2 weeks ago

In my case, train Qwen2.5-14B-Instruct, the grad norm quick increase nan

ByronHsu commented 2 weeks ago

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

fzyzcjy commented 2 weeks ago

Hi, is there any updates? Thanks!

Arcmoon-Hu commented 2 weeks ago

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

Thanks for quick reply. The version of liger-kernel is 0.3.1 Actually, I use LLaMA-Factory train my model and everything is fine without apply kernel. The only change I made was to add a line of parameters in the training config.

enable_liger_kernel: true

If need other information, I can supply

ByronHsu commented 2 weeks ago

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

Arcmoon-Hu commented 1 week ago

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

The question is solved, I just pull the latest code and rebuild it. It's really awesome! I tested qwen2.5-14b-Instruct model on one 8*A800 machine, per device batch_size doubled(2 ➡️ 4), and if keeping the total batch size equal, the training time 14 hours ➡️ 10.5 hours. Here is loss curve w/o liger-kernel by using transformers training, image the red line is transformers with liger-kernel By the way, I have changed the code according to #322

ByronHsu commented 1 week ago

@Arcmoon-Hu good to know that. I am aware of the transformer issue and will fix it ASAP