mit-han-lab streaming-llm issues

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

https://arxiv.org/abs/2309.17453

MIT License

6.59k stars 361 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Can you provide the code related to the visualization in the paper?

#86 micelvrice opened 1 month ago
0
【question】Does streaming-llm focus on accelerating decoding stage? How about the prefilling stage？

#85 Code24Man opened 2 months ago
0
Tokenizer issue with Transformers 4.33.0

#84 PedemonteGiacomo opened 3 months ago
0
Evaluation code and dataset release inquiry

#83 DerrickYLJ opened 3 months ago
0
How to visualize attention logits?

#82 OStars closed 4 months ago
1
what is the difference between window attention and sliding window recomputation

#81 seeyourcell closed 4 months ago
0
Progressively decreasing attention windows

#80 Vorlent opened 4 months ago
0
Using LLaVA model

#79 JesseZZZZZ opened 4 months ago
0
why `max_gen_len` is needed when considering `space_needed`?

#78 Mr-lonely0 opened 6 months ago
0
How to evaluate ppl?

#77 Jiawei-Yang opened 6 months ago
2
StreamEval

#76 Zhangchaoran000 opened 8 months ago
0
Support mistral-7b?

#75 spring1915 opened 9 months ago
0
Run with start_size=0 looks just fine

#74 cyr0930 opened 9 months ago
0
question about positions encoding when apply ROLLING KV CACHE WITH ATTENTION SINKS

#73 bugm closed 9 months ago
1
Error happened

#72 ForrestPi opened 9 months ago
2
Questions about ARC datasets

#71 Zoeyyao27 opened 10 months ago
0
How much GPU memory needed to run example ?

#70 fangming-he opened 10 months ago
3
Is there the way of parallel prompt ?

#69 DavideHe opened 10 months ago
0
Question about attention sink arising in pretrained models

#68 kevinli573 opened 10 months ago
0
Request for Code and Details on Figures 2 and 7

#67 ZhouZineng opened 10 months ago
0
Questions Related to the Application and Results of Attention Sinks After the Paper

#66 dsdanielpark opened 10 months ago
0
Questions Regarding "Sink Tokens"

#65 clarenceluo78 opened 11 months ago
0
Doubts in "run_streaming_llama.py" file

#64 Rishab9991 opened 11 months ago
0
Question about Naive Sliding Window

#63 kevinli573 closed 11 months ago
2
why starting sink token is not a special token '\n'?

#62 dhcode-cpp closed 11 months ago
2
Results for Section 3.2 Rolling KV Cache (Without Pretraining)

#61 timljj opened 11 months ago
1
The position id for q

#60 ofhwei opened 11 months ago
1
The reason for the importance of the initial token.

#59 freyamom opened 11 months ago
0
[Feature Request] Support InternLM Model

#58 vansin opened 11 months ago
1
Can support to ChatGLM2?

#57 KareEnges opened 11 months ago
0
Enable explictly setting transformer model cache

#56 JiaxuanYou opened 11 months ago
0
question about Table 1 in paper

#55 AresXD opened 11 months ago
1
question about initial tokens

#54 chaojiewang94 opened 11 months ago
2
While streaming with sinks, how does the framework change the positional encodings of the KV cache without having to multiply with the Key and Value matrices?

#53 Bhuvanesh09 opened 11 months ago
4
Finetuning a model in the streaming mode ?

#52 MohamedAliRashad closed 11 months ago
1
question about re-computation

#51 ysanimals closed 11 months ago
4
Implementation of lama2 7b chat hf model

#50 MuhammadIshaq-AI opened 11 months ago
7
Implementing lama2 7b

#49 MuhammadIshaq-AI closed 11 months ago
0
Is code's position wrong with "kv_cache.evict_for_space" ?

#48 DavideHe closed 11 months ago
2
some question about paper

#47 Vincentyua closed 11 months ago
1
Does past_key_values be repeatedly compute?

#46 freyamom opened 11 months ago
5
How to use streaming llm to train a new model? is there any sample code . thansk

#45 mega-cqz closed 11 months ago
1
I'm (A Bit) Suspicious of Table 3.

#44 FrederickGeek8 closed 11 months ago
1
Questions on the demo results

#43 BitCalSaul closed 11 months ago
2
Question on intuition of "attention sink" and "alibi PE"

#42 bowencohere closed 11 months ago
3
Question about long input and difference between streaming-llm and dense attention.

#41 hxs91 closed 11 months ago
2
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

#40 chnl closed 11 months ago
2
Question about evaluation results and demo

#39 hsm1997 closed 11 months ago
2
How to answer the question in the middle of long input

#38 yangzhj53 opened 12 months ago
0
RuntimeError in run_streaming_llama.py When Using Accelerate with Streaming LLMa Model on A4500 GPU

#37 ZexinLi0w0 opened 12 months ago
4