songmzhang / DSKD

Repo for Paper "Dual-Space Knowledge Distillation for Large Language Models".
29 stars 3 forks source link

Concern regarding performance #10

Closed survivebycoding closed 3 weeks ago

survivebycoding commented 1 month ago

image It seems the loss for dev set is increasing with epoch. Is this common?

songmzhang commented 1 month ago

Yes, it is common. Loss on dev set does not strictly reflect the performance on it (e.g., Rouge-L). This was also mentioned by the author of MiniLLM in this issue.

survivebycoding commented 1 month ago

we have tried for vanilla kd and DSDK as well

image here too same issue. So are you suggesting some other metric but Rouge-L?

songmzhang commented 1 month ago

Hi, I meant that you should use Rouge-L instead of loss on the dev set to evaluate the performance of the model. It is ok and common that the dev loss is increasing during training. This does not mean that your model is deteriorating.

survivebycoding commented 1 month ago

The minedit code for tiny llama cant have mistral as teacher model is it? I am getting this error: distillation.py: error: argument --projector-lr: expected one argument

songmzhang commented 1 month ago

The minedit code for tiny llama cant have mistral as teacher model is it? I am getting this error: distillation.py: error: argument --projector-lr: expected one argument

Please try the updated scripts in this repo.

survivebycoding commented 1 month ago

Trying with the updates code. Now getting this error:

File "/DSKD/code/criterions/min_edit_dis_kld.py", line 155, in transform_step_logits_fast [rank0]: base_model_special_token = TOKENIZER_TO_SPECIAL_TOKEN[ [rank0]: KeyError: <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'> E0812 07:29:50.616000 139955590409088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 18344) of binary: /DSKD/llm_kd/bin/python3.10

songmzhang commented 1 month ago

Trying with the updates code. Now getting this error:

File "/DSKD/code/criterions/min_edit_dis_kld.py", line 155, in transform_step_logits_fast [rank0]: base_model_special_token = TOKENIZER_TO_SPECIAL_TOKEN[ [rank0]: KeyError: <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'> E0812 07:29:50.616000 139955590409088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 18344) of binary: /DSKD/llm_kd/bin/python3.10

You can add LlamaTokenizerFast to TOKENIZER_TO_SPECIAL_TOKEN in min_edit_dis_kld.py like this. Or just pull the code in the latest repo.

survivebycoding commented 1 month ago

it worked with batch size 5 but didn't work with batch size 4

songmzhang commented 1 month ago

it worked with batch size 5 but didn't work with batch size 4

Can you provide the detailed error information?

survivebycoding commented 1 month ago

Say i have checkpoint for vanilla kd from with llama and tiny llama for 10 epochs. Can we use the same checkpoint to reload and run 10 more epochs?

songmzhang commented 1 month ago

Say i have checkpoint for vanilla kd from with llama and tiny llama for 10 epochs. Can we use the same checkpoint to reload and run 10 more epochs?

Our code has not supported continual training from existing checkpoints with the same optimizer states (e.g., continuous learning rate and Adam momentums). However, if you just want to continue training the model based on an existing checkpoint, you can try modifying CKPT_PATH in the training scripts with the absolute path of your existing checkpoint.

survivebycoding commented 1 month ago

So after doing sft on tiny llama, the final rough L value is 27.69. However, when we are doing vanilla kd with llama, the performance on dev set before trraining is 9.23. How is that possible. Shouldn't the tiny llama 's performance before it starts vanilla kd be same as the last epoch's performance when we are doing SFT on tiny lllama?

songmzhang commented 1 month ago
  1. For your case, you should make sure that the checkpoint after SFT is correctly loaded.
  2. We may need to remind you that you don't need to SFT the student model before KD. On the contrary, SFT is a parallel process with KD and serves as a baseline for KD. That is, in our code, the initial student model for KD is not the checkpoint after SFT, but should be the original pre-trained model.
survivebycoding commented 1 month ago

then what is teacher_peft path? peft path will be created only when sft is done right?TEACHER_MODEL_PATH="${BASE_PATH}/model_hub/${TEACHER_MODEL_TYPE}/${TEACHER_MODEL_NAME}" TEACHER_PEFT_PATH="/DSKD/outputs/llama2/llama2-7b-hf/sft/criterion=cross_entropy__lora-rank=256-alpha=8-dropout=0.1-bf16__epoch=10__bsz=8x1x1=8__lr=0.001/epoch10_step12870_loss2.9489_rougel33.6290"

songmzhang commented 1 month ago

The teacher model should be the checkpoint after SFT. So the TEACHER_PEFT_PATH is the absolute path of the teacher LoRA checkpoint (e.g., the SFT path of LLaMA2, not TinyLLaMA).

The content you pasted is correct.