mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Questions on the demo results #43

Closed BitCalSaul closed 9 months ago

BitCalSaul commented 9 months ago

I would like to express my gratitude for your paper and code, which have been truly enlightening for me. I conducted the experiments following the instructions provided in the README. I would be grateful if you could provide some insights regarding the following query.

In the scenario where "enable_streaming" is enabled, the language model (LLM) performs well when presented with questions from the test dataset. It generates responses to each question seamlessly until completion.

However, in the case where "enable_streaming" is not enabled, the LLM initially responds well to a few questions at the beginning. However, as more questions are presented, its performance deteriorates, ultimately leading to an error: "torch.cuda.OutOfMemoryError: CUDA out of memory," which aligns with what is shown in the README video.

I can comprehend that the decline in performance in the second case may be attributed to attention sink's eviction. Nevertheless, I am uncertain about the reason behind the "CUDA out of memory" error.

I would greatly appreciate any insights or clarifications you could offer on this matter.

Guangxuan-Xiao commented 9 months ago

Hi,

I appreciate your interest in our work and the detailed observation.

The demo utilizes the Dense Attention baseline when StreamingLLM is not enabled. This essentially means it retains all past conversations' Key and Value states. The model's performance is likely to deteriorate significantly once the token length surpasses the pre-training window size, which, for the default model, is set at 2048 tokens. While the model continues to operate, the output becomes less meaningful, with tokens that aren't printable. If the cached Key and Value states surpass the available GPU memory, you'll encounter the "Out of Memory" (OOM) error.

I hope this helps clear up your question.

Guangxuan

BitCalSaul commented 9 months ago

Hi @Guangxuan-Xiao , thank you for the explanation. It is clear to me now.