Open dcsuka opened 5 months ago
Hey @dcsuka, thanks for raising this - it's a great motivator for adding some new functionality in torchtune. We're working on this at the moment by implementing RoPE scaling (see the PR linked above) - feel free to follow along there!
All of the models supported by torchtune currently have rather low context lengths (<32k). I have found long-context success with torchtune simply by finetuning gradientai/Llama-3-8B-Instruct-Gradient-1048k and just changing the default max_seq_len for llama3 in the torchtune scripts via an editable install.
However, I would like to be able to finetune a long context length model with fewer parameters as well, namely Phi-3-mini-128k-instruct. A similar strategy of changing the permitted max_seq_len in the model folder scripts results in training losses that are orders of magnitude higher than with llama for a simple task, and the losses never decrease, leading me to believe that some less superficial implementation changes might be necessary for Phi-3-mini-128k-instruct. Does anybody know what those changes might be?