Can this approach generate data for multi-turn conversations?

magpie-align / magpie

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!

https://magpie-align.github.io/

MIT License

418 stars 43 forks source link

Can this approach generate data for multi-turn conversations? #27

Closed Tendo33 closed 3 weeks ago

Tendo33 commented 3 weeks ago

I've noticed that the dataset you posted consists of single-turn dialogues. If you don't include multi-turn dialogue data, will that affect the model's final performance? Looking forward to your reply.

fly-dust commented 3 weeks ago

We also have some multi-turn data available, e.g., https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-MT-300K-v0.1. Ideally, multi-turn datasets can help LLMs perform better. However, we found from our empirical analysis that the performance increase using multi-turn datasets is marginal. In other words, single-turn alignment data can already make LLMs good at multi-turn dialogues.

Tendo33 commented 3 weeks ago

Thanks for the reply. BTW, I noticed in the script that for the qwen2 series, it doesn't seem to have prompts specifically designed for tasks like "translation", "code", "math". Is this because those prompt haven't been tested yet? Will they be added later on?

fly-dust commented 3 weeks ago

Indeed. We haven't tested "translation", "code" and "math" on the Qwen family. I think you can easily modify the config to support these tasks~