Closed survivebycoding closed 2 months ago
Yes, it is common. Loss on dev set does not strictly reflect the performance on it (e.g., Rouge-L). This was also mentioned by the author of MiniLLM in this issue.
we have tried for vanilla kd and DSDK as well
here too same issue. So are you suggesting some other metric but Rouge-L?
Hi, I meant that you should use Rouge-L instead of loss on the dev set to evaluate the performance of the model. It is ok and common that the dev loss is increasing during training. This does not mean that your model is deteriorating.
The minedit code for tiny llama cant have mistral as teacher model is it?
I am getting this error:
distillation.py: error: argument --projector-lr: expected one argument
The minedit code for tiny llama cant have mistral as teacher model is it? I am getting this error:
distillation.py: error: argument --projector-lr: expected one argument
Please try the updated scripts in this repo.
Trying with the updates code. Now getting this error:
File "/DSKD/code/criterions/min_edit_dis_kld.py", line 155, in transform_step_logits_fast [rank0]: base_model_special_token = TOKENIZER_TO_SPECIAL_TOKEN[ [rank0]: KeyError: <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'> E0812 07:29:50.616000 139955590409088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 18344) of binary: /DSKD/llm_kd/bin/python3.10
Trying with the updates code. Now getting this error:
File "/DSKD/code/criterions/min_edit_dis_kld.py", line 155, in transform_step_logits_fast [rank0]: base_model_special_token = TOKENIZER_TO_SPECIAL_TOKEN[ [rank0]: KeyError: <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'> E0812 07:29:50.616000 139955590409088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 18344) of binary: /DSKD/llm_kd/bin/python3.10
You can add LlamaTokenizerFast
to TOKENIZER_TO_SPECIAL_TOKEN
in min_edit_dis_kld.py like this. Or just pull the code in the latest repo.
it worked with batch size 5 but didn't work with batch size 4
it worked with batch size 5 but didn't work with batch size 4
Can you provide the detailed error information?
Say i have checkpoint for vanilla kd from with llama and tiny llama for 10 epochs. Can we use the same checkpoint to reload and run 10 more epochs?
Say i have checkpoint for vanilla kd from with llama and tiny llama for 10 epochs. Can we use the same checkpoint to reload and run 10 more epochs?
Our code has not supported continual training from existing checkpoints with the same optimizer states (e.g., continuous learning rate and Adam momentums). However, if you just want to continue training the model based on an existing checkpoint, you can try modifying CKPT_PATH
in the training scripts with the absolute path of your existing checkpoint.
So after doing sft on tiny llama, the final rough L value is 27.69. However, when we are doing vanilla kd with llama, the performance on dev set before trraining is 9.23. How is that possible. Shouldn't the tiny llama 's performance before it starts vanilla kd be same as the last epoch's performance when we are doing SFT on tiny lllama?
then what is teacher_peft path? peft path will be created only when sft is done right?TEACHER_MODEL_PATH="${BASE_PATH}/model_hub/${TEACHER_MODEL_TYPE}/${TEACHER_MODEL_NAME}" TEACHER_PEFT_PATH="/DSKD/outputs/llama2/llama2-7b-hf/sft/criterion=cross_entropy__lora-rank=256-alpha=8-dropout=0.1-bf16__epoch=10__bsz=8x1x1=8__lr=0.001/epoch10_step12870_loss2.9489_rougel33.6290"
The teacher model should be the checkpoint after SFT. So the TEACHER_PEFT_PATH
is the absolute path of the teacher LoRA checkpoint (e.g., the SFT path of LLaMA2, not TinyLLaMA).
The content you pasted is correct.
It seems the loss for dev set is increasing with epoch. Is this common?