Closed junzhang-zj closed 1 month ago
Very hard to say just from this information. I assume you target the same layers for both? So print_trainable_parameters
should give you (almost) the same values? Perhaps Llama3 works better with different hyper-parameters, but I haven't tested it myself.
Thanks for your help. The target layer is the same, I will try other hyper parameters.
The problem was with the pre-saved dataset I was processing with the LLaMA-2 tokenizer.
When I try to fine-tune both LLaMA-2-70B and LLaMA-3.1-70B with LoRA using the same code, the 3.1 seems to have an unusual loss landscape, is there anything I should be aware of?