princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
MIT License
619 stars 36 forks source link

Unable to reproduce the results of SFT #27

Closed yujiaw98 closed 1 month ago

yujiaw98 commented 2 months ago

Hi, thanks again for the interesting work.

I followed the hyperparameter settings for SFT outlined in the paper (learning rate of 2e-5, batch size of 128, and cosine learning rate scheduling), but I am still unable to train an SFT model that achieves similar evaluation results as those reported in the SimPO paper. For AlpacaEval2.0, my SFT model achieves an LC of 4.80 and a WR of 2.89. Could you provide more details about the SFT training process?

Thanks!

niravlg commented 2 months ago

Hi @yumeng5, Thanks for providing the models! I agree. It would be super useful if you to provide more training details. For instance, could you let us know - 1) Prompt template used for SFT 2) Train/val split on Ultrachat 200k.

Any more details would also be helpful! Thanks!

xiamengzhou commented 2 months ago

@yujiaw98 @niravlg

We used the same templates for SFT and PO models. Please find the templates in txt format, and jinja format. Let me know if you have further questions!

niravlg commented 2 months ago

Thanks @xiamengzhou! Could you share the recipes you used for the LLAMA3-8B models for SFT, DPO and ORPO? The paper mentions some parameters, but not all.

yumeng5 commented 1 month ago

Hi @yujiaw98 @niravlg

We have added the training scripts for Llama-3 SFT here. The hyperparameters used for DPO can be found here.

Best, Yu

yujiaw98 commented 1 month ago

Hi @yumeng5 ,

Thanks for providing the training scripts for SFT and DPO. They help a lot!

I have another small question: it seems that for SFT, you’ve used a chat_template for Llama3, but for SimPO, it doesn’t appear to need a chat_template. Could you please provide some insight into this? Or did I perhaps miss a key part of the code?

Best, Yujia

yumeng5 commented 1 month ago

Hi @yujiaw98

For the Llama3-Base SFT training stage, we need to create an SFT model by starting from the Llama3-Base model, which does not have a chat_template so we need to manually define it. For the preference optimization stage, we start from the SFT model which already has the chat_template so we don't need to re-define it.

Best, Yu

yujiaw98 commented 1 month ago

Hi @yujiaw98

For the Llama3-Base SFT training stage, we need to create an SFT model by starting from the Llama3-Base model, which does not have a chat_template so we need to manually define it. For the preference optimization stage, we start from the SFT model which already has the chat_template so we don't need to re-define it.

Best, Yu

Thank you for the reply.

Best, Yujia

obangw commented 2 weeks ago

Hello @xiamengzhou, I have a few questions about the SFT process as well. Given that HuggingFaceH4/ultrachat_200k is a multi-turn dialogue dataset, how did you process it into labels? Did you interleave the turns, split them into single-turn, or simply use the last round of the assistant's response as the target?