microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.71k stars 283 forks source link

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

Open shhn1 opened 1 week ago

shhn1 commented 1 week ago

I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.

When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.

When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.

Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.