About SeqKD with different vocabularies

songmzhang / DSKD

Repo for Paper "Dual-Space Knowledge Distillation for Large Language Models".

29 stars 3 forks source link

About SeqKD with different vocabularies #2

Closed 2018cx closed 2 months ago

2018cx commented 2 months ago

Hello, could you please elaborate on the implementation of SeqKD? Given that the vocabularies differ, the KL loss cannot be directly applied. How did you overcome this issue? If token alignment was used, could you specify which alignment methods were employed and explain the distinctions between SeqKD and MinED? As per my understanding, MinED, as utilized in the script, involves alignment followed by KL loss calculation. Thank you.

songmzhang commented 2 months ago

Hi, SeqKD in our paper is a black-box KD method and does not involve the KL loss. We just generate and collect the output responses from the teacher models on the training set and use these responses to replace the original responses in the training set to train the student models via cross-entropy (like many GPT-4 distillation methods, collecting the GPT-4's responses to train the student models).

Nevertheless, we thank you for pointing out this issue and we will complement the code for the repsonse generation process of the teacher model in this repo as soon as possible!

2018cx commented 2 months ago

Thank you for your response. So, does that mean SeqKD does not use ground-truth labels and only uses the output text from the teacher model? Do your other settings require the use of ground-truth labels, or did you rely solely on the output from the teacher model?

songmzhang commented 2 months ago

Thank you for your response. So, does that mean SeqKD does not use ground-truth labels and only uses the output text from the teacher model? Do your other settings require the use of ground-truth labels, or did you rely solely on the output from the teacher model?

Yes, SeqKD does not use ground-truth labels. However, other token-level methods combine the cross-entropy loss on the ground-truth labels and the KD loss from the teacher model (in Eqn. (15)).