[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank

xvyaward commented 1 month ago

Describe the issue

Following the issue 155, I'm trying to reproduce the results of the official llmlingua-2-xlm-roberta-large-meetingbank model using Mistral-7B as black-box llm.

In specific, I tried to fine-tune the XLM-RoBERTa model with the officially provided dataset, using this train.sh.

Here is my detailed process:

Format, label, and filter the official dataset with reference to the collect_data.sh.
Fine-tune the XLM-RoBERTa model using train.sh, with hyperparameters from the LLMLingua-2 paper.

Here are the current issues:

It's hard to reproduce the Table-4 results of LLMLingua-2 paper, or even scores in issue 155. Here are my reproduced results:

	MeetingBank	MeetingBank	LongBench
	QA	summary	2000 token avg.	2000 token narrativeqa	multifieldqa_en	multifieldqa_zh	qasper
LLMLingua-2 scores reproduced with official model weights	73.59	29.95	25.65	10.07	36.61	26.47	29.46
LLMLingua-2 reproduced with fine-tuning	68.95	30.05	24.67	9.14	33.91	26.49	29.12

I found that the official llmlingua-2-xlm-roberta-large-meetingbank model weight has the word_embedding size of [250102, 1024]. This is larger than the original [250002, 1024] size of XLM-RoBERTa. I guess this is relevant to the added special tokens in prompt_compressor.py, but train_roberta.py example does nothing about this, so my fine-tuned model has the same word_embedding weight size with the original RoBERTa ([250002, 1024]).
- I tried to resize token embedding size first then fine-tune, but the results were almost the same.
I guess the example in train.sh doesn't use filtered results, which is named annotation_kept_cs512_meetingbank_train_formated.pt in collect_data.sh. This seems like a minor issue :)

If the process of training the official model is the same as the process provided as an example here, can you please let me know what needs to be changed in the above process? Thank you for reading.

iofu728 commented 1 month ago

Hi @xvyaward, thanks for your support of LLMLingua-2 and for sharing the detailed experimental results.

Hi @pzs19, could you provide more details to help @xvyaward reproduce the experiments? Thanks!

pzs19 commented 1 month ago

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.
The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.
Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

dingjingzhen commented 1 month ago

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.

The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.

Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

pzs19 commented 1 month ago

Hi @xvyaward, thanks for your interest and the very detailed description.

Could you please share more information on how you use the mistral model for inference? As the sampling and evaluation strategies can have an impact on the overall results, such as the temperature used in sampling and whether to truncate the answer when "\n" appears.

The reason why word_embedding is largers than the original one is that we add special tokens to assign to the words that need to be forcibly retained. For example, "llmlingua" may be tokenized to "llm" and "lingua". If we want to always keep "llmlingua", we need to replace it to a new token before using tokenizer. We did not add these additional tokens during training.

Thank you for pointing this out, we will fix it soon.

Hope these explanations can help you.

Is there a standard full training script available? We also expect to train a compressor ourselves, including the word_embedding mentioned earlier.

Yes! We have provided the experiment code for LLMLingua-2 in ./experiments/llmlingua2. The training data for the compressor is also available at HuggingFace.

You can run ./experiments/llmlingua2/data_collection/collect_data.sh first, which will get word labels in the original data and filter out bad samples. Then use the train.sh script in ./experiments/llmlingua2/model_training to train the compressor. You may need to modify the training code to include special tokens during training.

xvyaward commented 1 month ago

Hi @pzs19, thank you for your kind reply.

I modified eval_meetingbank_qa.py to use vllm version of mistral model. Here is the mine code for generation part:
```
terminators = [
    model.get_tokenizer().eos_token_id
]
```

sampling_params = SamplingParams( max_tokens=args.n_max_token_ans, stop_token_ids=terminators, temperature=0.0, top_p=1.0, )

response = model.generate(query, sampling_params=sampling_params)


I used temperature=0.0 and top_p=1.0 following the paper, and I believe answers are truncated with "\n" during evaluation, by experiments/llmlingua2/evaluation/metrics.py.

However, I still can't reproduce the score from the official llmlingua-2-xlm-roberta-large-meetingbank. The score for in-domain meetingbank_qa has especially dropped significantly, from 73.6 to 68.

2.  In the first answer, you mentioned "We did not add these additional tokens during training."
However, you also suggested that "You may need to modify the training code to include special tokens during training." to the answer for dingjingzhen.

So which answer is correct to reproduce the score of llmlingua-2-xlm-roberta-large-meetingbank using the official MeetingBank-LLMCompressed dataset? And if possible, can you share example code that handles special tokens during training?

Thank you.

pzs19 commented 1 month ago

Hi @pzs19, sorry for the misunderstanding.

In the last response, I mean if you want to add special tokens during training, you need to modify our training code. In our experiment, special tokens are not added during training.

microsoft / LLMLingua

[Question]: Reproducing the score of official microsoft/llmlingua-2-xlm-roberta-large-meetingbank #156

Describe the issue