unexpected response when using llama2-7b-chat

snu-mllab / Context-Memory

Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)

https://arxiv.org/abs/2312.03414

MIT License

48 stars 1 forks source link

unexpected response when using llama2-7b-chat #3

Open kaishxu opened 5 months ago

kaishxu commented 5 months ago

Hello!

I'm trying to use your pre-trained model with this command: CUDA_VISIBLE_DEVICES=4,5,6,7 python inference.py -i -m llama-2-7b-chat --eval_name concat_recur

However, there is an unexpected generation stop when inputting the query: help me list popular songs written by Taylor Swift.

The result is shown as follows:

It stops generating more content and outputs </s> instead.

Are there any other settings I missed?

Janghyun1230 commented 5 months ago

Hello! I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists: 스크린샷 2024-04-17 오전 11 50 11

However, after the compression, the model seems to produce EOS token before the lists: 스크린샷 2024-04-17 오전 11 48 32

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

kaishxu commented 5 months ago

Thanks so much for your quick reply!

I have another question about the class LinearMask() in most modeling files under the directory "arch". As shown in the following figure, the forward input of LinearMask() includes comp_mask. However, the specific operation doesn't apply this variable.

If this variable is not used, the linear mapping function is the same as the original function in "modeling_llama.py".

kaishxu commented 5 months ago

Hello! I just tried the query with the given command and the current Github commits.

At the beginning state of the chat, the model produces the lists:

However, after the compression, the model seems to produce EOS token before the lists:

Comparing the results above, it seems that the generation code is not the problem. My suspect is that our training data (for compression adapter) is mainly composed of sentences without \n tokens, and this affects the phenomenon above. To solve the problem, I think we need to design new training data.

It is an interesting phenomenon as compression tokens affect generation capability.

Janghyun1230 commented 5 months ago

For the question regarding LinearMask, comp_mask works with LoRA. I modified the LoRA Huggingface code at src/peft_custom/lora.py.

https://github.com/snu-mllab/Context-Memory/blob/24af6a0e951076f7d9d7cc8601418ed08e9f1865/src/peft_custom/lora.py#L565

Without LoRA, our model works the same as the original function, while the LoRA activates only for the compression tokens.