mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Questions about "Run Streaming Llama Chatbot" #36

Closed ChuanhongLi closed 9 months ago

ChuanhongLi commented 9 months ago

First of all, thanks for releasing the excellent work! I have some questions running the example you provided. I use the command:

# I have downloaded the Llama-2-7b-hf and put it to /data/model/Llama-2-7b-hf
CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming  --model_name_or_path /data/model/Llama-2-7b-hf

And I get the following results:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:41<00:00, 20.89s/it]
Loading data from data/mt_bench.jsonl ...
prompts length:  158
StartRecentKVCache: 4, 2000

USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: seq_len:  38

### 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1
该问题的推理速度:23.53038809423589 token/sec

USER: Rewrite your previous response. Start every sentence with the letter A.

ASSISTANT: seq_len:  24
USER: Rewrite your previous response. Start every sentence with the letter B.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter C.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter D.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter E.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter F.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter G.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter H.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter I.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter J.

ASSISTANT:

USER: Rewrite your previous response. Start every sentence with the letter K.

It seems that it does not work well! Anything wrong with my test? Should I change some things to get right results?

And when using the lmsys/vicuna-13b-v1.3 as the model, the results seems ok.

Thanks!

Guangxuan-Xiao commented 9 months ago

Hello, thank you for expressing interest in our work! While Llama-2-7b-hf has not been instruction tuning and isn't ideal for chatbot applications, we recommend you consider instruction-tuned models like Vicuna or Llama-2-7b-chat-hf for that purpose.

ChuanhongLi commented 9 months ago

Hello, thank you for expressing interest in our work! While Llama-2-7b-hf has not been instruction tuning and isn't ideal for chatbot applications, we recommend you consider instruction-tuned models like Vicuna or Llama-2-7b-chat-hf for that purpose.

Thank you for your reply. One more question, the figure 10 in your paper also uses instruction-tuned Llama-2-7b(Llama-2-13b)? image

Guangxuan-Xiao commented 9 months ago

Figure 10 is about efficiency results. Using instruction-tuned models (*-chat) and base models has identical results.