Open Se-Hun opened 1 month ago
Can you update your code example with how you're applying LigerKernel?
Fwiw, Qwen2.5 uses the same model architecture as Qwen2 so Liger should still work correctly: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json#L3
@tyler-romero
Thank you for quickly response.
I simply used Liger through the --use_liger_kernel=True
option in the Huggingface trainer.
While it is true that Qwen-2.5 uses the same architecture as Qwen-2, using Liger did not result in a decrease in loss value for Qwen-2.5.
When training Qwen-2.5 without using Liger, the loss value decreased effectively.
@Se-Hun cc @tyler-romero This is the same issue with #268; the monkey patch methods to an already instanciated model do not copy the weights of the original model. HF trainer and TRL SFTrainer relies on the methods, while axolotl does not. You may use axolotl until the issue is fixed.
In my case, train Qwen2.5-14B-Instruct, the grad norm quick increase nan
@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?
Hi, is there any updates? Thanks!
@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?
Thanks for quick reply. The version of liger-kernel is 0.3.1 Actually, I use LLaMA-Factory train my model and everything is fine without apply kernel. The only change I made was to add a line of parameters in the training config.
enable_liger_kernel: true
If need other information, I can supply
@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!
@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!
The question is solved, I just pull the latest code and rebuild it. It's really awesome! I tested qwen2.5-14b-Instruct model on one 8*A800 machine, per device batch_size doubled(2 ➡️ 4), and if keeping the total batch size equal, the training time 14 hours ➡️ 10.5 hours. Here is loss curve w/o liger-kernel by using transformers training, the red line is transformers with liger-kernel By the way, I have changed the code according to #322
@Arcmoon-Hu good to know that. I am aware of the transformer issue and will fix it ASAP
🐛 Describe the bug
I am trying to instruction tuning Qwen2.5-14B-Instruct with Liger Kernel.
I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?
Reproduce
Python Code Example :
Run Example :
Versions
Environment Report:
Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35 Python version: 3.10.14 PyTorch version: 2.4.0+cu121 CUDA version: 12.1 Triton version: 3.0.0 Transformers version: 4.45.0.dev0