Support for Phi-3-mini-128k-instruct and larger context length models

All of the models supported by torchtune currently have rather low context lengths (<32k). I have found long-context success with torchtune simply by finetuning gradientai/Llama-3-8B-Instruct-Gradient-1048k and just changing the default max_seq_len for llama3 in the torchtune scripts via an editable install.

However, I would like to be able to finetune a long context length model with fewer parameters as well, namely Phi-3-mini-128k-instruct. A similar strategy of changing the permitted max_seq_len in the model folder scripts results in training losses that are orders of magnitude higher than with llama for a simple task, and the losses never decrease, leading me to believe that some less superficial implementation changes might be necessary for Phi-3-mini-128k-instruct. Does anybody know what those changes might be?

pytorch / torchtune

Support for Phi-3-mini-128k-instruct and larger context length models #1120