uclaml / SPPO

The official implementation of Self-Play Preference Optimization (SPPO)
https://uclaml.github.io/SPPO/
Apache License 2.0
491 stars 62 forks source link

Is it possible to run llama 3-70B and/or mixtral 8x22b through this process? #1

Open RandomInternetPreson opened 4 months ago

RandomInternetPreson commented 4 months ago

I'm running the Llama-3-Instruct-8B-SPPO-Iter3 model locally and am very impressed by the improved quality from the original model. I can't help but wonder what the results would be if this finetuning process were run on larger models.

Is it possible to run the code on these larger models, or are the smaller versions too different form their larger counterparts; requiring a rework of the training scripts?

Thank you for what you have contributed, this is great stuff!

angelahzyuan commented 4 months ago

Thank you! We've trained a slightly larger model (UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3) which achieved an LC-win rate of 53.27, using the same parameters and scripts.

As long as your GPU has sufficient VRAM, the training script should perform well. We will keep you updated as we proceed to training larger models.