mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Run with start_size=0 looks just fine #74

Open cyr0930 opened 6 months ago

cyr0930 commented 6 months ago

I've run a number of experiments and it looks like that most of the performance comes from enabling pos_shift.

python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8
6.840701103210449

python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 1 --recent_size 255
29.674755096435547

python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 0 --recent_size 256 --enable_pos_shift
8.8959321975708

python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 1 --recent_size 255 --enable_pos_shift
7.493190765380859

python examples/eval_long_ppl.py --model_name_or_path lmsys/vicuna-13b-v1.3 --num_samples 8 --enable_start_recent_kv_cache --start_size 4 --recent_size 252 --enable_pos_shift
7.363883018493652

And also generated output of the following script looks fine to me. python examples/run_streaming_llama.py --enable_streaming --recent_size 128 --start_size 0

Am I doing something wrong? (choice of model or dataset could matter??) Is it okay to conclude that major factor which harms generation performance is wrongly-used pos encoding?