songmzhang / DSKD

Repo for Paper "Dual-Space Knowledge Distillation for Large Language Models".
29 stars 3 forks source link

load 72B teacher model #13

Open ypw-lbj opened 3 weeks ago

ypw-lbj commented 3 weeks ago

For a larger teacher model, such as 72B, how should I modify the code of this framework? Please give me some guidance. Thank you.

songmzhang commented 3 weeks ago

Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?

ypw-lbj commented 3 weeks ago

Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?

I use 8 GPUs. If I use 72B_path, I will get OOM problem. So I change GPUS_PER_NODE=1, and then change AutoModelForCausalLM.from_pretrained( device_map=“auto”, }, but there will be conflict in deepspeed.initialize and OOM problem will occur. I hope this repo can support larger models and will have greater influence. Thank you

songmzhang commented 3 weeks ago

Sorry, due to the device limitation, we haven't tried to load a 72B teacher model in our experiments. Will you encounter any issue if you just modify the TEACHER_MODEL_PATH to the path of the 72B teacher model?

I use 8 GPUs. If I use 72B_path, I will get OOM problem. So I change GPUS_PER_NODE=1, and then change AutoModelForCausalLM.from_pretrained( device_map=“auto”, }, but there will be conflict in deepspeed.initialize and OOM problem will occur. I hope this repo can support larger models and will have greater influence. Thank you

Thanks for your suggestion! We will work on it in recent weeks. BTW, I think a more efficient way to use such a large teacher model is to pre-compute the logits of the teacher and save them before KD. Anyway, we will try to fix this issue as soon as possible.