pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.05k stars 371 forks source link

v0.3 regression, full_finetune_distributed slower ? #1718

Open Delaunay opened 3 hours ago

Delaunay commented 3 hours ago

The recipe full_finetune_distributed Appear to be much slower in v0.3 than v0.2.1

Everything seems to work as usual, but my job that used to work in v0.2.1 time out in v0.3.0.

I don't have much detail yet, but maybe as you are more familiar with the code base you could have an idea already based on what changed recently!

joecummings commented 3 hours ago

Can you share a few more details around which models you're using, size of dataset, machine type?

Off the very top of my head, not sure what would be going on.