tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Add GPT-NeoX/Pythia support + benchmark results #4

Closed tomaarsen closed 10 months ago

tomaarsen commented 10 months ago

Hello!

Pull Request overview

Details

As simple as

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("EleutherAI/pythia-6.9b", device_map="auto")

Benchmarks

python benchmark/perplexity.py --model_name_or_path EleutherAI/pythia-6.9b-deduped --experiment attention_sinks --output_dir benchmark/outputs_pythia_6.9b
python benchmark/perplexity.py --model_name_or_path EleutherAI/pythia-6.9b-deduped --experiment transformers --output_dir benchmark/outputs_pythia_6.9b
python benchmark/perplexity.py --model_name_or_path EleutherAI/pythia-6.9b-deduped --experiment windowed --output_dir benchmark/outputs_pythia_6.9b

python benchmark/plot_perplexity.py --features perplexity vram --title "Log perplexity & VRAM usage of Pythia 6.9B as a function of input lengths" --output_dir benchmark/outputs_pythia_6.9b --log_perplexity_limit 4

pythia_6 8b_ppl_vram_plotted