uclaml / SPPO

The official implementation of Self-Play Preference Optimization (SPPO)
https://uclaml.github.io/SPPO/
Apache License 2.0
498 stars 62 forks source link

Dataset used and results in Gemma-2-9B results #12

Closed hodachi-axcxept closed 4 months ago

hodachi-axcxept commented 4 months ago

Thanks for the great product. I am so impressed with your research that I have tried it many times. However, the results with Gemma-2-9B are very different from your results.

The score was even Iter-3 lower than the original Gemma2-9B-it.

My question is, what did you use, UCLA-AGI/data-mistral-7b-instruct-sppo-iter[x]?

I am aware that these or others were based on UltraFeedBack, and the Github code was the same.

Sincelery, Kazuya

hodachi-axcxept commented 4 months ago

Incidentally, another Prompt dataset I prepared scored better. But the Iter3 Gemma-2-9B you provided scored even better.

So I have my doubts that the data set was achieved with 3 iterations of about 20,000 cases.

angelahzyuan commented 4 months ago

@hodachi-axcxept , the UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3 was achieved by running the script https://github.com/uclaml/SPPO/blob/main/run_sppo_gemma-2.sh. This is the exact script we used (batch size and accumulation steps might differ for different models, all other hyperparameters are the same).

The prompt sets used are UCLA-AGI/data-mistral-7b-instruct-sppo-iter${i}

hodachi-axcxept commented 4 months ago

@angelahzyuan Thanks for the thoughtful reply.That's what I needed to hear.So it looks like the results are not constant.I used the same shell, same data set.

angelahzyuan commented 4 months ago

@hodachi-axcxept We've never seen that happened. The same settings always work for all models that we tried. Pretty stable on our end.

Please report more details. e.g. your results of both 3 iterations, your config and prompt when during evaluation. Your machine used. So that we could discuss potential reasons.

hodachi-axcxept commented 4 months ago

@angelahzyuan Thanks for checking. I rather believe in your research. That's why I'm doing a lot of research to prove it and to instill this research as a technique to create a lightweight model future. So I wanted to clarify if it is a matter of my methodology or if there is something I don't know.

From now on I will try again but the configuration I went with is as follows

Machine and environment (otherwise as installed on Github) A100×8 (Runpod) 80GB pytorch 2.3+py3.10+cuda11.9.0 ubuntu22.04 VLLM (https://github.com/vllm-project/vllm/commit/80ca1e6a3a28a0373dc00c5b4fe956c16de952fa)

Contents of run_sppo_gemma-2.sh.

! /bin/bash

iter_num=3 for i in $(seq 1 $iter_num); do if [ "$i" -eq 1 ]; then MODEL="google/gemma-2-9b-it" else MODEL=$OUTPUT_DIR fi OUTPUT_DIR="checkpoints/Gemma-2-9B-SPPO-It-Iter${i}" PROMPT="UCLA-AGI/data-mistral-7b-instruct-sppo-iter${i}" OUT="data-gemma-2-9b-it-sppo-iter${i}" echo "running epoch $i" bash scripts/generate.sh --model $MODEL --prompt $PROMPT --out_path $OUT bash scripts/pipeline.sh --model $MODEL --iter $i --dataset "synthetic_data_gemma-2-9b-it-sppo-iter${i}_score" --output_dir $OUTPUT_DIR -- num 1 --batch_size 2 --accumulate 2 done

If this is not changed, the upload process uploads the data to UCLA-AGI, so "--org"--"HODACHI" is added in compute_prob.py in generate.sh.

Other than that, nothing else has been changed.

Note that a memory violation occurs in the latest vllm. Also, when I used UCLA-AGI's PROMPT, it was not uploaded to HuggingFace, so it does not remain there.

If you use my dataset instead, it is below. HODACHI/mine_synthetic_data_gemma-2-9b-it-sppo-iter1_score HODACHI/mine_data_gemma-2-9b-it-sppo-iter1_generated HODACHI/mine_data-gemma-2-9b-it-sppo-iter2_generated HODACHI/mine_synthetic_data_gemma-2-9b-it-sppo-iter2_score HODACHI/mine_data-gemma-2-9b-it-sppo-iter3_generated HODACHI/mine_synthetic_data_gemma-2-9b-it-sppo-iter3_score

hodachi-axcxept commented 4 months ago

The score was better when I did it with my data set described above than with the data set you had prepared. However, my results were lower than your iter3.

angelahzyuan commented 4 months ago

@hodachi-axcxept The dataset provided in "UCLA-AGI/data-mistral-7b-instruct-sppo-iter${i}" should not be used for training. It should only be used for prompts. You are supposed to regenerate new dataset for training, just as what you seem to have done above.

hodachi-axcxept commented 4 months ago

@angelahzyuan Thank you for response. My understanding is that if "UCLA-AGI/data-mistral-7b-instruct-sppo-iter[i]" is specified, a data set is automatically created with the results generated using that Prompt, ranked by rank. I am aware that training is then performed based on that dataset, and that is what happened when I traced the sh file. I used that sh, so I don't think it affects the results, no?

hodachi-axcxept commented 4 months ago

For example, responses_0_0.json looks like this The principle is that 5 of these are created, and after ranking these, the training dataset is the best one created, right?

0"latex \documentclass{article} \usepackage{tabular} \usepackage{amssymb} \begin{document} \section{Results} This section presents the findings of the study comparing the two groups. \subsection{Group Comparison} A significant difference was observed between the two groups on the primary outcome measure. This difference is summarized in the table below: \begin{table}[h] \centering \begin{tabular}{|l|c|c|c|} \hline \textbf{Variable} & \textbf{Group 1} & \textbf{Group 2} & \textbf{p-value} \\ \hline Outcome Measure & Mean $\pm$ SD & Mean $\pm$ SD & \\ \hline \end{tabular} \caption{Comparison of Outcome Measure between Group 1 and Group 2} \label{tab:group_comparison} \end{table} \subsection{Treatment Effectiveness} The treatment was found to be effective for a wide range of conditions. \subsection{Statistical Analysis} Statistical analysis was performed using a [insert specific statistical test, e.g., t-test, ANOVA]. The [insert specific software, e.g., SPSS, R] software package was used for all analyses. \begin{lstlisting}[language=R] # Example R code snippet t.test(Outcome_Measure ~ Group, data = study_data) \end{lstlisting} \end{document} Please Note: You will need to replace the placeholders in the table and code snippet with the actual data and statistical details from your study. You should also include a description of the outcome measure and the specific conditions for which the treatment was found effective. This LaTeX code provides a basic structure for presenting your results. You can customize it further to meet the specific requirements of your publication. "

angelahzyuan commented 4 months ago

@hodachi-axcxept I haven't checked other details yet. What I see here is the batch size is different than ours. If you use batch size=2, accumulation steps should be 4. That might be our typo, if batchsize 4 is too large for A100 80G.

hodachi-axcxept commented 4 months ago

@angelahzyuan I see. So there is a possibility that there is no convergence thereby. Thanks for the tip. I'll give it a try.

hodachi-axcxept commented 4 months ago

[Tips]. Gemma-2-9B stops working when the latest vllm is installed. [Correspondence] Enable flashinfer (by export) Install flashinfer. However, parallel processing fails after flashinfer is enabled. (memory consistency error)

Arnav0400 commented 3 months ago

Hi @angelahzyuan thanks for your work! I want to know if the number of epochs per iteration is 1 or 18 in the case of gemma2. Also, what is the motivation for the same.