philschmid / deep-learning-pytorch-huggingface

MIT License
580 stars 138 forks source link

Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G. #35

Open mlscientist2 opened 10 months ago

mlscientist2 commented 10 months ago

Hi,

First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits. I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2

Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?

I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.

image image image

Any thoughts @philschmid ?

hassantsyed commented 7 months ago

any ideas here? showing 11 hrs for 1024 context and 22 hrs for 2048. Would love to get this down to 3!