[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory

I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.

When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.

When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.

Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.

microsoft / LMOps

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281