ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.08k stars 1.19k forks source link

Implement Sample Packing for Efficient LLM Training #3538

Open fire opened 1 year ago

fire commented 1 year ago

Is your feature request related to a problem? Please describe. LLM training is expensive, allowing sample packing is a more efficient way of training.

Describe the use case I am trying to train a LLM to generate blueprint floorplans based on an apache 2 training data set from May 2022.

Describe the solution you'd like

"Rather than processing only a single sample per batch, the method aims to accommodate as many samples as possible within your sequence length, padding any remaining space. Additional calculations are performed to prevent contamination between samples in the same batch and to minimize impact on your preset hyperparameters."

Use sample packing to reduce the training time.

https://arxiv.org/pdf/2107.02027.pdf

Describe alternatives you've considered

Buy more GPUS or spend more time.

Additional context

See https://github.com/OpenAccess-AI-Collective/axolotl/pull/285

tgaddair commented 1 year ago

Hey @fire, thanks for raising this issue! It's something I've also been musing over recently, so good to know there's interest!

One question: my impression is this technique would primarily benefit you if you have enough headroom during training to accommodate the extra samples (once we remove padding). One way this might manifest itself is if you're having to decrease max_sequence_length to avoid CUDA OOMs because some batches are just a lot longer than others, leading to random errors in the middle of training. Has anything like that been your experience?

fire commented 1 year ago

I am still a novice at machine learning model finetuning so I don't have access to the relevant hardware, but my peer suggested this improvement for my gamer graphics card.

I wanted to file a formal report so that I didn't forget.

The advantage for the people who can use large memory gpus, they can also use large context window models. I imagine that the performance increase scales with the wallet.

fire commented 1 year ago

My data set is very short.

https://sketchfab.com/3d-models/cbf-architype-03-b855f5ec8f944a4ba476fdd41e5adcb0

https://huggingface.co/spaces/ifire/Architext_deployed

two bedrooms and two bathrooms
{"the bathroom is not adjacent to the living room [layout] bathroom": [9.507042253521128, 12.112676056338028, 7.464788732394367, 12.112676056338028, 7.464788732394367, 11.056338028169014, 9.507042253521128, 11.056338028169014], "bedroom1": [13.661971830985916, 10.070422535211268, 10.563380281690142, 10.070422535211268, 10.563380281690142, 9.014084507042254, 9.507042253521128, 9.014084507042254, 9.507042253521128, 4.859154929577465, 13.661971830985916, 4.859154929577465], "bedroom2": [13.661971830985916, 12.112676056338028, 9.507042253521128, 12.112676056338028, 9.507042253521128, 11.056338028169014, 10.563380281690142, 11.056338028169014, 10.563380281690142, 10.070422535211268, 13.661971830985916, 10.070422535211268], "living_room": [9.507042253521128, 9.014084507042254, 4.366197183098592, 9.014084507042254, 4.366197183098592, 3.8732394366197185, 9.507042253521128, 3.8732394366197185], "kitchen": [6.408450704225352, 14.15492957746479, 4.366197183098592, 14.15492957746479, 4.366197183098592, 10.070422535211268, 6.408450704225352, 10.070422535211268], "corridor": [10.563380281690142, 11.056338028169014, 7.464788732394367, 11.056338028169014, 7.464788732394367, 10.070422535211268, 4.366197183098592, 10.070422535211268, 4.366197183098592, 9.014084507042254, 10.563380281690142, 9.014084507042254]}

image

tgaddair commented 1 year ago

Thanks for the additional context @fire! Just to make sure I understand your current issues: are you finding that training time is taking longer than expected, or that you're running into out-of-memory errors, or something else? It might be that the best solution to address your issue is something else, like gradient checkpointing, etc.

Also, can you share some details about your current Ludwig config and the GPU you're using?