First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits.
I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2
Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?
I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.
Hi,
First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits. I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2
Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?
I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.
Any thoughts @philschmid ?