Are there any duplicates in these datasets?

magpie-align / magpie

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!

https://magpie-align.github.io/

MIT License

418 stars 43 forks source link

Are there any duplicates in these datasets? #23

Closed Tendo33 closed 4 weeks ago

Tendo33 commented 1 month ago

I see you've released four datasets. Are there any duplicates among the contents of these four datasets?

Model Name Dataset Type Description Qwen2 72B Instruct Magpie-Qwen2-Pro-1M SFT 1M Raw conversations built with Qwen2 72B Instruct. Qwen2 72B Instruct Magpie-Qwen2-Pro-300K-Filtered SFT Apply a filter and select 300K high quality conversations. Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-Chinese SFT Apply a filter and select 200K high quality Chinese conversations. Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-English SFT Apply a filter and select 200K high quality English conversations.

fly-dust commented 1 month ago

Magpie-Qwen2-Pro-300K-Filtered, Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-Chinese and Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-English are filtered from 1M raw dataset. There is no overlap between Chinese and English subsets, but Magpie-Qwen2-Pro-300K-Filtered may contain data from both Chinese and English subsets.

Tendo33 commented 1 month ago

For data generated by other models, like llama3, is the distribution of dataset repetitions similar too?

fly-dust commented 1 month ago

Yes. I remembered I added detailed filter setups in the data card for each filtered dataset. All datasets with "Filtered" or "MT" in their names are filtered from a big raw dataset.

fly-dust commented 4 weeks ago

I will close this issue as complete since it is not active~