magpie-align / magpie

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
https://magpie-align.github.io/
MIT License
418 stars 43 forks source link

The evaluation of input_quality #11

Closed wwwadx closed 2 months ago

wwwadx commented 2 months ago

Thanks the author for open sourcing such great work! I'm a little confusing on the input_quality filtering, why should the input quality be measured? The sft data need some bad instructions as users not always input clear and coherence instructions. So I guess if a model is only trained using good input quality, the robustness of the model will be hurt?

fly-dust commented 2 months ago

Hi, Thanks for your valuable suggestions! You are right that if humans write the instructions, we should not filter the unclear user instructions. However, when we were generating the synthetic dataset with high temperatures, we found that occasionally the model would output contents with no sense (e.g., a message that consists of multiple languages and/or symbols we cannot understand). Therefore, we apply the quality filter.

Empirically, we also found that applying a quality filter can increase the model's performance. We also provided the raw data here so feel free to design your own filter!