试图生成qwen2.5系列模型的magpie数据，但生成的instruction大多不可用

magpie-align / magpie

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!

https://magpie-align.github.io/

MIT License

418 stars 43 forks source link

试图生成qwen2.5系列模型的magpie数据，但生成的instruction大多不可用 #28

Open WeixuanXiong opened 1 week ago

WeixuanXiong commented 1 week ago

用qwen2.5 7B模型生成数据的时候发现生成的instruction大部分都是这种没头没尾的文本片段。请问是否别的模型也有这个问题呢？如果有怎么解决它呢？

感谢！

fly-dust commented 1 week ago

It should work now. The trick is to use the pre-query template in the tokenizer config.

By the way, I found that the 7B model doesn't work so well, but 3B works great. You can use 3B for now.

WeixuanXiong commented 1 week ago

Thanks!

I found that training chat model which is, for instance, qwen2-instruct harm the model's capability such as instruction following and etc. So it could only work well on base model when using magpie data?

fly-dust commented 1 week ago

If you continue aligning a chat model, then you should be careful about distribution shift, which might do harm to the model performance. But ideally, if you use Qwen2.5-7B Ins's response to fine-tune Qwen2.5-7B Ins, it should be fine...

WeixuanXiong commented 1 week ago

I've tried using your Magpie_Qwen2_Pro_200K_Chinese_training datasets which i believe is generated from qwen2 72b chat? I think the alignment data used on 72B model and 7B model is the same or majorly overlapped. The distribution between datasets from 7B model and datasets from 72B model may not be that huge? If i'm wrong, please let me know.

thanks~