Open JasonZhu1313 opened 3 months ago
cc @ArthurZucker @muellerzr
Sounds great! Awesome work from your team 🥳
https://github.com/linkedin/Liger-Kernel/issues/70 Would love to have discussion on the better UX. cc @ArthurZucker @philschmid et al
It looks like there is still an issue if using use_liger_kernel=True
and torch_compile=True
in Trainer with Llama: https://github.com/linkedin/Liger-Kernel/issues/174
Feature request
Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to HuggingFace Trainer, user could decide whether to enable kernel with a simple flag
Motivation
Liger (Linkedin GPU Efficient Runtime) Kernel is a collection of Triton kernels designed specifically for LLM training. We have implemented Hugging Face Compatible RMSNorm, RoPE, SwiGLU, CrossEntropy, FusedLinearCrossEntropy, and more to come. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. The kernel works out of the box with flash attention, PyTorch FSDP, and Microsoft DeepSpeed. We welcome contributions from the community to gather the best kernels for LLM training.
Your contribution
We (LinkedIn) will take care of work for a smooth integration and would need HF review and feedback for changes.
Benchmark
Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
The throughput increases by approximately 20% with more data, but the GPU memory is reduced by 40%. This means you can train the model on smaller GPUs, with larger batch sizes, or with longer sequence lengths at no additional cost.
For more detailed benchmark setup and more exciting efficiency for multi-head training (Medusa), please refer to original repo: https://github.com/linkedin/Liger-Kernel (Repo will be public soon)