Open ypw-lbj opened 3 weeks ago
Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?
Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?
I use 8 GPUs. If I use 72B_path, I will get OOM problem. So I change GPUS_PER_NODE=1, and then change AutoModelForCausalLM.from_pretrained( device_map=“auto”, }, but there will be conflict in deepspeed.initialize and OOM problem will occur. I hope this repo can support larger models and will have greater influence. Thank you
Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?
I use 8 GPUs. If I use 72B_path, I will get OOM problem. So I change GPUS_PER_NODE=1, and then change AutoModelForCausalLM.from_pretrained( device_map=“auto”, }, but there will be conflict in deepspeed.initialize and OOM problem will occur. I hope this repo can support larger models and will have greater influence. Thank you
Thanks for your suggestion! We will work on it in recent weeks. BTW, I think a more efficient way to use such a large teacher model is to pre-compute the logits of the teacher and save them before KD. Anyway, we will try to fix this issue as soon as possible.
For a larger teacher model, such as 72B, how should I modify the code of this framework? Please give me some guidance. Thank you.