Open xvyaward opened 1 month ago
Hi @xvyaward, thanks for your support of LLMLingua-2 and for sharing the detailed experimental results.
Hi @pzs19, could you provide more details to help @xvyaward reproduce the experiments? Thanks!
Hi @xvyaward, thanks for your interest and the very detailed description.
Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
Thank you for pointing this out, we will fix it soon.
Hope these explanations can help you.
Hi @xvyaward, thanks for your interest and the very detailed description.
- Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
- The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
- Thank you for pointing this out, we will fix it soon.
Hope these explanations can help you.
Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.
Hi @xvyaward, thanks for your interest and the very detailed description.
- Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
- The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
- Thank you for pointing this out, we will fix it soon.
Hope these explanations can help you.
Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.
Yes! We have provided the experiment code for LLMLingua-2 in ./experiments/llmlingua2. The training data for the compressor is also available at HuggingFace.
You can run ./experiments/llmlingua2/data_collection/collect_data.sh first, which will get word labels in the original data and filter out bad samples. Then use the train.sh script in ./experiments/llmlingua2/model_training to train the compressor. You may need to modify the training code to include special tokens during training.
Hi @pzs19, thank you for your kind reply.
terminators = [
model.get_tokenizer().eos_token_id
]
sampling_params = SamplingParams( max_tokens=args.n_max_token_ans, stop_token_ids=terminators, temperature=0.0, top_p=1.0, )
response = model.generate(query, sampling_params=sampling_params)
I used temperature=0.0 and top_p=1.0 following the paper, and I believe answers are truncated with "\n" during evaluation, by experiments/llmlingua2/evaluation/metrics.py.
However, I still can't reproduce the score from the official llmlingua-2-xlm-roberta-large-meetingbank. The score for in-domain meetingbank_qa has especially dropped significantly, from 73.6 to 68.
2. In the first answer, you mentioned "We did not add these additional tokens during training."
However, you also suggested that "You may need to modify the training code to include special tokens during training." to the answer for dingjingzhen.
So which answer is correct to reproduce the score of llmlingua-2-xlm-roberta-large-meetingbank using the official MeetingBank-LLMCompressed dataset? And if possible, can you share example code that handles special tokens during training?
Thank you.
Hi @pzs19, sorry for the misunderstanding.
In the last response, I mean if you want to add special tokens during training, you need to modify our training code. In our experiment, special tokens are not added during training.
Describe the issue
Following the issue 155, I'm trying to reproduce the results of the official llmlingua-2-xlm-roberta-large-meetingbank model using Mistral-7B as black-box llm.
In specific, I tried to fine-tune the XLM-RoBERTa model with the officially provided dataset, using this train.sh.
Here is my detailed process:
Here are the current issues:
I found that the official llmlingua-2-xlm-roberta-large-meetingbank model weight has the word_embedding size of [250102, 1024]. This is larger than the original [250002, 1024] size of XLM-RoBERTa. I guess this is relevant to the added special tokens in prompt_compressor.py, but train_roberta.py example does nothing about this, so my fine-tuned model has the same word_embedding weight size with the original RoBERTa ([250002, 1024]).
I guess the example in train.sh doesn't use filtered results, which is named annotation_kept_cs512_meetingbank_train_formated.pt in collect_data.sh. This seems like a minor issue :)
If the process of training the official model is the same as the process provided as an example here, can you please let me know what needs to be changed in the above process? Thank you for reading.