microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.56k stars 268 forks source link

About pre-training data of distilled qwen model #253

Closed Jim2016713 closed 1 day ago

Jim2016713 commented 3 weeks ago

Is there any pre-trained model data or recommendation for the distilled qwen model? The effect has been verified on the llama model, but the performance on qwen is very poor. I suspect that the problem is the pre-trained data.

t1101675 commented 1 week ago

I think the dolly dataset is enough for qwen. For more high-quality data, you can try sharegpt. Note: Qwen's tokenization is much larger than LLaMA, which means the processed data should be stored in int32, not uint16 as in the codebase.