microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.39k stars 253 forks source link

llama version in Minillm #218

Open kaizizzzzzz opened 2 months ago

kaizizzzzzz commented 2 months ago

Is it Llama1 or Llama2? Thx

t1101675 commented 1 month ago

The distilled models and the experiments in our paper are based on LLaMA-1.

kaizizzzzzz commented 1 month ago

Should it be easy to use this repo to KD llama2?

t1101675 commented 1 month ago

Yes. We have implemented the model parallism and SFT for LLaMA2. The KD scripts are easy to be adapted from LLaMA-1.

kaizizzzzzz commented 1 month ago

It seems that there is no need to modify the src code to adapt LLaMA2, just simply changing the script is enough?

t1101675 commented 1 month ago

Exactly.

kaizizzzzzz commented 1 month ago

Hello Yuxian, I'm a little curious about the minillm process and the dataset for using, and want to check my understanding. I have two questions

  1. There are 2 datasets used for minillm: use gpt2 for example:
    PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/"
    LM_DATA_DIR="${BASE_PATH}/processed_data/openwebtext/gpt2/512/10M/"

The first one is also used for sftand kd baseline method, but the second dataset is only for minim, I'm not familiar with kd, and I saw the explanation in the paper and I am confused, which says

Input: Conditional generation dataset D consisting of prompts and ground-truth responses Pre-training corpus **_DPT_** consisting of long-document plain texts
A teacher model with output distribution p
An initial student model **_pre-trained on DPT_**, with the output distribution qθ0 Learning rate η; Batch size M; Clipping Threshold ε

As for my understanding, should these two DPT stand for different meanings? The first DPT is used for calculating language modeling loss LPT = − Ed∼DPT log qθ (d), which is openwebtext/gpt2 here. And the second DPT is the pretrain data the model used. Like GPT2 or llama, the pretrain data is private. so here we use other pretrain datasets like openwebtext and roberta to calculate language modeling loss LPT = − Ed∼DPT log qθ (d). This is only because we don't have the real pretrain data? And it could be better if we do have the real pretrain dataset? Is my understanding correct? And, another confusing point here is because the pretrain data we use here is different from the actual model pretrain data. So when we use minillm, do we need to train from scratch using the pretrain data here? Or we can keep this difference and just use the released gpt2 and llama?

  1. I saw the hyperparameter setting for epoch is 10, the dataset is big and I don't have enough gpu for such a huge epoch, is small epoch still works? Such as using epoch 1 instead.

Such two redundant questions and really appreciate your responses, thanks!

t1101675 commented 1 month ago
  1. We use openwebtext simply because the pre-training data of GPT2 is not available. GPT2 is pre-trained with WebText, which is generally assumed to share the similar distribution with openwebtext. The RoBERTa corpus is a subset of LLaMA's pre-training corpus, which ensures DPT does not introduce extra knowledge beyond pre-training. I think it would not make much difference when using the actual pre-training corpus. When we use MiniLLM, we just use the released GPT2 and LLaMA (DPT can be treated as a regularization).
  2. The SFT baselines should be trained for about 10 epochs before they reach the best performance. The total training steps of MiniLLM are controlled by --total-iters 5000, which corresponds to 6 or 7 epochs. I think 1 epoch is not enough for the models to achive the performance in our paper. (NOTE: this epoch argument refers to the epoch of intruction data: PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/", not openwebtext. Acturally, openwebtext is trained for less than an epoch.)
kaizizzzzzz commented 1 month ago

Thanks, that makes sense!

BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?

I have just finished the lora sft of llama2-1.1B, and in /results/llama2/train/sft, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?

kaizizzzzzz commented 1 month ago

Hello Yuxian, would you mind also sharing the link to the dataset of roberta you used before processing? I'm training minillm for llama2, and I saw there are two questions about the roberta dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't include roberta, and I want to process it based on llama2's tokenizer.

I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the roberta dataset link. Thanks!

t1101675 commented 1 month ago

Thanks, that makes sense!

BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?

I have just finished the lora sft of llama2-1.1B, and in /results/llama2/train/sft, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?

We haven't tried using lora for MiniLLM. I guess it would not affect the performance much. Choosing the final model based on the 'rougeL score' is fine.

t1101675 commented 1 month ago

Hello Yuxian, would you mind also sharing the link to the dataset of roberta you used before processing? I'm training minillm for llama2, and I saw there are two questions about the roberta dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't include roberta, and I want to process it based on llama2's tokenizer.

I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the roberta dataset link. Thanks!

It would take some time for us to get the roberta dataset ready. We construct roberta dataset simply by merging those sub-datasets and tokenizing them. Since the dataset is used for regularization and only a small subset of the data is acturally used in training (less than a epoch), little difference in merging sub-datasets will not make great difference.