pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.35k stars 441 forks source link

Support for Phi-3-mini-128k-instruct and larger context length models #1120

Open dcsuka opened 5 months ago

dcsuka commented 5 months ago

All of the models supported by torchtune currently have rather low context lengths (<32k). I have found long-context success with torchtune simply by finetuning gradientai/Llama-3-8B-Instruct-Gradient-1048k and just changing the default max_seq_len for llama3 in the torchtune scripts via an editable install.

However, I would like to be able to finetune a long context length model with fewer parameters as well, namely Phi-3-mini-128k-instruct. A similar strategy of changing the permitted max_seq_len in the model folder scripts results in training losses that are orders of magnitude higher than with llama for a simple task, and the losses never decrease, leading me to believe that some less superficial implementation changes might be necessary for Phi-3-mini-128k-instruct. Does anybody know what those changes might be?

SalmanMohammadi commented 4 months ago

Hey @dcsuka, thanks for raising this - it's a great motivator for adding some new functionality in torchtune. We're working on this at the moment by implementing RoPE scaling (see the PR linked above) - feel free to follow along there!