uclaml / SPIN

The official implementation of Self-Play Fine-Tuning (SPIN)
https://uclaml.github.io/SPIN/
Apache License 2.0
995 stars 89 forks source link

Some detailed questions regarding SPIN #9

Closed peterjc123 closed 7 months ago

peterjc123 commented 7 months ago

Hi, thanks for open sourcing your work. That's awesome. I have several questions about the details of the implementation.

  1. Why do you use the RMSProp optimizer and the linear schedule instead of the commonly used AdamW optimizer and the cosine schedule?
  2. How to prepare data for the iter 0 of SPIN? It seems that in generate.sh, it starts with the data preparation work for iter 1.
  3. Regarding the prompt template, it seems that you used a different template for synthetic data generation from the one which is used for zephyr. May I know the reason behind that? What if the same prompt is used for synthetic data generation and the SFT training?
  4. For synthetic data generation, the max output length is limited to 256, but during training the max length and the max prompt length are 1024 and 512 respectively. So, the synthetic data is ending with more paddings than the actual data during training. Is that right? Have you tried to extend the max output length for data generation?

I've also skimmed the code. Here are some potential improvements.

  1. https://github.com/uclaml/SPIN/blob/main/spin/generate_vllm.py#L46 For greedy generation, the temperature should be set to zero according to vllm.
  2. https://github.com/uclaml/SPIN/blob/main/spin/alignment/trainer.py#L552-L559 Swap the order of cpu() and mean() for faster metrics collection on slower CPUs
yihedeng9 commented 7 months ago

Hello, thank you for your interest and detailed comments.

  1. Our initial configuration (scheduler, optimizer, and beta) follows the DPO training setup for Zephyr as outlined in the Alignment Handbook (https://github.com/huggingface/alignment-handbook/blob/61a11a5c7d66179ed0a930b0dd12e532fce701dd/recipes/zephyr-7b-beta/dpo/config_full.yaml). While their latest update suggests using Adam and a cosine scheduler, the previous set of configurations are sufficiently effective for our experiments. Nonetheless, we do recommend exploring these settings further for your specific use cases. For our experiences, the two sets of configs work similar in the final performances.
  2. For iterations, it only depends on which model you are using for data generation. In our experiment setting, it's using alignment-handbook/zephyr-7b-sft-full to generation data at iteration 0. For replicating our 50k data generation, simply use UCLA-AGI/SPIN_iter0. The generation program primarily utilizes the provided questions/prompts for model generation. For other data, just convert the data into the same format.
  3. We adopted a prompt template similar to Alpaca's due to its wide usage. Investigating how different prompts affect SPIN's performance is indeed a very interesting direction.
  4. The maximum output length is adjustable. We set it to 256 for quicker generation demonstrations, but 512 and 1024 are also viable options. Similarly, the temperature parameter can indeed be modified for efficiency.

Thank you very much for the suggestions! We will make changes accordingly.

peterjc123 commented 7 months ago

@yihedeng9 Thanks for your prompt answer.

2. For iterations, it only depends on which model you are using for data generation. In our experiment setting, it's using alignment-handbook/zephyr-7b-sft-full to generation data at iteration 0. For replicating our 50k data generation, simply use UCLA-AGI/SPIN_iter0. The generation program primarily utilizes the provided questions/prompts for model generation. For other data, just convert the data into the same format.

So, you are using the Alpaca's prompt template for data generation with the model alignment-handbook/zephyr-7b-sft-full to produce UCLA-AGI/SPIN_iter0, right? I guess it possibly leads to worse initial data compared with that generated using the Zephyr's prompt template. I just want to confirm whether this is a trick or it is okay to use the prompt template of the original model.

The maximum output length is adjustable. We set it to 256 for quicker generation demonstrations, but 512 and 1024 are also viable options. Similarly, the temperature parameter can indeed be modified for efficiency.

Yes, but does this affect the performance of the models?

  1. Our initial configuration (scheduler, optimizer, and beta) follows the DPO training setup for Zephyr as outlined in the Alignment Handbook (https://github.com/huggingface/alignment-handbook/blob/61a11a5c7d66179ed0a930b0dd12e532fce701dd/recipes/zephyr-7b-beta/dpo/config_full.yaml). While their latest update suggests using Adam and a cosine scheduler, the previous set of configurations are sufficiently effective for our experiments. Nonetheless, we do recommend exploring these settings further for your specific use cases. For our experiences, the two sets of configs work similar in the final performances.

Yes, the results on alignment-handbook/zephyr-7b-sft-full is similar to the scores in the paper for cosine and AdamW. But it is not working that well using another model and datasets. Could you please give me some advice? (e.g. dataset selection, prompt usage, etc.)

yihedeng9 commented 7 months ago
  1. The prompt template is not a trick but a default template we used. To reproduce our results, the template we provided should be used. But you can definitely try other templates for exploration.
  2. Not necessarily, we did use 256/512 for generation and both are fine.
  3. By our experiences, I mean the final performances of SPIN using two different configs are similar on some other models we tried. I'm not sure on the specific application to the model and dataset of your interest, but you can always start with our given settings and explore the other parameters if needed.
peterjc123 commented 7 months ago

Thanks. I suppose data is the big problem. I will need to find some high quality datasets like ultrachat-200k in my language.