Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G.

Hi,

First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits. I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2

Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?

I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.

Any thoughts @philschmid ?

philschmid / deep-learning-pytorch-huggingface

Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G. #35