Closed chiragjn closed 8 months ago
This is unlikely to explain your observations, but theoretically it could be disk caching at work when it's always slow the first time around to load the model but fast the second time.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am noticing some weird not so easy to reproduce behaviour. I am working with Llama 2 70B for qlora fine-tuning on 2 x A100 80 GB GPUs in DDP mode.
The safetensors weights are already present on the disk When I load the model with the below config, it takes about ~45 mins (2700 seconds)!
So I tried loading the model without DDP and quantization using transformers
And it took about 730 seconds.
But once that was fully done and when I quit and relaunched my finetuning script, almost every time the model manages to load within 90s seconds which was very puzzling to me.
I know there are factors like active memory and GPU memory consumption that go into accelerate's dispatch calculations, so I have waited for several minutes between runs and made sure GPU memory is clear before starting.
What can explain such a dramatic speedup?
Expected behavior
Ideally, it would be amazing if such a large model could load within 90s seconds every time consistently
EDIT: Saw same behavior with Mixtral