Closed kaizizzzzzz closed 1 month ago
The distilled models and the experiments in our paper are based on LLaMA-1.
Should it be easy to use this repo to KD llama2?
Yes. We have implemented the model parallism and SFT for LLaMA2. The KD scripts are easy to be adapted from LLaMA-1.
It seems that there is no need to modify the src code to adapt LLaMA2, just simply changing the script is enough?
Exactly.
Hello Yuxian, I'm a little curious about the minillm process and the dataset for using, and want to check my understanding. I have two questions
PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/"
LM_DATA_DIR="${BASE_PATH}/processed_data/openwebtext/gpt2/512/10M/"
The first one is also used for sft
and kd
baseline method, but the second dataset is only for minim, I'm not familiar with kd, and I saw the explanation in the paper and I am confused, which says
Input: Conditional generation dataset D consisting of prompts and ground-truth responses Pre-training corpus **_DPT_** consisting of long-document plain texts
A teacher model with output distribution p
An initial student model **_pre-trained on DPT_**, with the output distribution qθ0 Learning rate η; Batch size M; Clipping Threshold ε
As for my understanding, should these two DPT stand for different meanings? The first DPT
is used for calculating language modeling loss LPT = − Ed∼DPT log qθ (d), which is openwebtext/gpt2 here. And the second DPT
is the pretrain data the model used. Like GPT2 or llama, the pretrain data is private. so here we use other pretrain datasets like openwebtext and roberta to calculate language modeling loss LPT = − Ed∼DPT log qθ (d). This is only because we don't have the real pretrain data? And it could be better if we do have the real pretrain dataset? Is my understanding correct? And, another confusing point here is because the pretrain data we use here is different from the actual model pretrain data. So when we use minillm, do we need to train from scratch using the pretrain data here? Or we can keep this difference and just use the released gpt2 and llama?
Such two redundant questions and really appreciate your responses, thanks!
--total-iters 5000
, which corresponds to 6 or 7 epochs. I think 1 epoch is not enough for the models to achive the performance in our paper. (NOTE: this epoch
argument refers to the epoch of intruction data: PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/"
, not openwebtext. Acturally, openwebtext is trained for less than an epoch.)Thanks, that makes sense!
BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?
I have just finished the lora sft of llama2-1.1B, and in /results/llama2/train/sft
, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?
Hello Yuxian, would you mind also sharing the link to the dataset of roberta
you used before processing? I'm training minillm for llama2, and I saw there are two questions about the roberta
dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't include roberta
, and I want to process it based on llama2's tokenizer.
I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the roberta
dataset link. Thanks!
Thanks, that makes sense!
BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?
I have just finished the lora sft of llama2-1.1B, and in
/results/llama2/train/sft
, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?
We haven't tried using lora for MiniLLM. I guess it would not affect the performance much. Choosing the final model based on the 'rougeL score' is fine.
Hello Yuxian, would you mind also sharing the link to the dataset of
roberta
you used before processing? I'm training minillm for llama2, and I saw there are two questions about theroberta
dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't includeroberta
, and I want to process it based on llama2's tokenizer.I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the
roberta
dataset link. Thanks!
It would take some time for us to get the roberta dataset ready. We construct roberta dataset simply by merging those sub-datasets and tokenizing them. Since the dataset is used for regularization and only a small subset of the data is acturally used in training (less than a epoch), little difference in merging sub-datasets will not make great difference.
Is it Llama1 or Llama2? Thx