Closed Tendo33 closed 4 weeks ago
Magpie-Qwen2-Pro-300K-Filtered, Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-Chinese and Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-English are filtered from 1M raw dataset. There is no overlap between Chinese and English subsets, but Magpie-Qwen2-Pro-300K-Filtered may contain data from both Chinese and English subsets.
For data generated by other models, like llama3, is the distribution of dataset repetitions similar too?
Yes. I remembered I added detailed filter setups in the data card for each filtered dataset. All datasets with "Filtered" or "MT" in their names are filtered from a big raw dataset.
I will close this issue as complete since it is not active~
I see you've released four datasets. Are there any duplicates among the contents of these four datasets?
Model Name Dataset Type Description Qwen2 72B Instruct Magpie-Qwen2-Pro-1M SFT 1M Raw conversations built with Qwen2 72B Instruct. Qwen2 72B Instruct Magpie-Qwen2-Pro-300K-Filtered SFT Apply a filter and select 300K high quality conversations. Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-Chinese SFT Apply a filter and select 200K high quality Chinese conversations. Qwen2 72B Instruct Magpie-Qwen2-Pro-200K-English SFT Apply a filter and select 200K high quality English conversations.