microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

Integrate qwen and qwen_parallel into minillm pipeline #143

Closed SleepEarlyLiveLong closed 9 months ago

SleepEarlyLiveLong commented 9 months ago

Integrate the open-sourced model Qwen (https://huggingface.co/Qwen) into the minillm distillation algorithm, supporting both non-parallel and parallel training. mainly added 2 folders: transformers/src/transformers/models/qwen/ transformers/src/transformers/models/qwen_parallel/

Tips:

  1. Due to the vocab_size of Qwen's tokenizer exceeds 150k, which is much larger than gpt2 (~50k) or llama (32k), I did slight modification of the data processing code related to tokenizer to prevent overflow, like from uint16 to uint32;
  2. Slightly added few lines in transformers/src/transformers/ init.py and transformers/src/transformers/models/ init.py to adapt to qwen;
  3. The corresponding training data like dolly and roberta/openwebtext needs to be pre-processed using qwen's tokenizer, which is not done here.
SleepEarlyLiveLong commented 9 months ago

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.