nvtransfer / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Apache License 2.0
646 stars 43 forks source link

📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

Models Claimed Length Effective Length 4K 8K 16K 32K 64K 128K Avg. wAvg. (inc) wAvg. (dec)
Llama2 (7B) 4K 85.6
Jamba-1.5-large* (94B/398B) 256k >128k 96.7 96.6 96.4 96.0 95.4 95.1 96.0 95.7 (1st) 96.3 (1st)
Gemini-1.5-pro 1M >128K 96.7 95.8 96.0 95.9 95.9 94.4 95.8 95.5 (2nd) 96.1 (2nd)
Jamba-1.5-mini (12B/52B) 256K >128K 95.6 95.6 94.8 94.6 92.8 90.0 93.9 93.1 (3rd) 94.8 (3rd)
GPT-4-1106-preview 128K 64K 96.6 96.3 95.2 93.2 87.0 81.2 91.6 89.0 (4th) 94.1 (4th)
Llama3.1 (70B) 128K 64K 96.5 95.8 95.4 94.8 88.4 66.6 89.6 85.5 (9th) 93.7 (5th)
Command-R-plus-0824 (104B) 128K 32K 96.0 95.1 94.0 92.4 85.4 64.6 87.9 83.4 (12th) 92.4 (6th)
Qwen2 (72B) 128K 32K 96.9 96.1 94.9 94.1 79.8 53.7 85.9 79.6 (16th) 92.3 (7th)
Command-R-plus (104B) 128K 32K 95.6 95.2 94.2 92.0 84.3 63.1 87.4 82.7 (13th) 92.1 (8th)
Command-R-0824 (32B) 128K 64K 94.7 93.7 93.1 90.8 86.6 74.7 88.9 86.0 (7th) 91.9 (9th)
GLM4 (9B) 1M 64K 94.7 92.8 92.1 89.9 86.7 83.1 89.9 88.0 (5th) 91.7 (10th)
Llama3.1 (8B) 128K 32K 95.5 93.8 91.6 87.4 84.7 77.0 88.3 85.4 (10th) 91.3 (11th)
Command-R (35B) 128K 32K 93.8 93.3 92.4 89.5 84.9 76.0 88.3 85.5 (8th) 91.1 (12th)
MegaBeam-Mistral (7B) 512K 32K 93.8 92.5 92.0 89.2 83.7 83.7 89.1 87.3 (6th) 91.0 (13th)
Mistral-Large (123B) 128K 32K 96.2 96.1 95.1 93.0 78.8 23.7 80.5 70.6 (22nd) 90.4 (14th)
GradientAI/Llama3 (70B) 1M 16K 95.1 94.4 90.8 85.4 80.9 72.1 86.5 82.6 (14th) 90.3 (15th)
Mixtral-8x22B (39B/141B) 64K 32K 95.6 94.9 93.4 90.9 84.7 31.7 81.9 73.5 (20th) 90.3 (16th)
Yi (34B) 200K 32K 93.3 92.2 91.3 87.5 83.2 77.3 87.5 84.8 (11th) 90.1 (17th)
Phi3-mini (3.8B) 128K 32K 92.2 91.5 90.7 87.5 80.6 66.7 84.8 80.9 (15th) 88.7 (18th)
Phi3-medium (14B) 128K 32K 93.3 93.2 91.1 86.8 78.6 46.1 81.5 74.8 (19th) 88.3 (19th)
Mixtral-8x7B (12.9B/46.7B) 32K 32K 94.9 92.1 92.5 85.9 72.4 44.5 80.4 72.8 (21st) 87.9 (20th)
GradientAI/Llama3 (8B) 1M 16K 92.8 90.3 85.7 79.9 76.3 69.5 82.4 78.5 (17th) 86.3 (21st)
FILM-7B* (7B) 32K 32K 92.8 88.2 88.1 86.9 70.1 27.1 75.5 66.4 (24th) 84.7 (22nd)
InternLM2.5 (7B) 1M 4K 88.1 85.5 84.5 82.7 75.5 68.9 80.9 77.8 (18th) 83.9 (23rd)
Mistral (7B) 32K 16K 93.6 91.2 87.2 75.4 49.0 13.8 68.4 55.6 (26th) 81.2 (24th)
Mistral-Nemo 128K 16K 87.8 87.2 87.7 69.0 46.8 19.0 66.2 54.7 (27th) 77.8 (25th)
GLM3 (6B) 128K 4K 87.8 83.4 78.6 69.9 56.0 42.0 69.6 62.0 (25th) 77.2 (26th)
LWM (7B) 1M <4K 82.3 78.4 73.7 69.1 68.1 65.0 72.8 69.9 (23rd) 75.7 (27th)
DBRX (36B/132B) 32K 8K 95.1 93.8 83.6 63.1 2.4 0.0 56.3 38.0 (28th) 74.7 (28th)
Qwen1.5 (72B) 32K 8K 94.9 93.8 78.0 67.8 0.0 0.0 55.7 37.5 (29th) 74.0 (29th)
Together (7B) 32K 4K 88.2 81.1 69.4 63.0 0.0 0.0 50.3 33.8 (30th) 66.7 (30th)
LongChat (7B) 32K <4K 84.7 79.9 70.8 59.3 0.0 0.0 49.1 33.1 (31th) 65.2 (31th)
LongAlpaca (13B) 32K <4K 60.6 57.0 56.6 43.6 0.0 0.0 36.3 24.7 (32nd) 47.9 (32nd)

đź’ˇ Requirements

🔍 Evaluate long-context LMs

1. Download data

3. Run evaluation pipeline

🧠 (Optional) Customize task complexity

The tasks to be evaluated on are stored in scripts/config_tasks.sh. Configuration of each task is defined in scripts/synthetic.yaml. The complexity of each task can be configured by changing the arguments which we describe in detail below.

Category Task name Configurations
Retrieval niah type_haystack: repeat/essay/needle
# repeat: repeated noise sentences
# essay: Paul Graham Essays
# needle: distracted needles

type_needle_k: words/numbers/uuids
type_needle_v: words/numbers/uuids
# words: adjective-noun
# numbers: 7 digits
# uuids: 32 digits

num_needle_k: int >= 1
# add multiple needles in haystack
num_needle_v: int >= 1
# retrieve multiple values from a single key
num_needle_q: int >= 1
# retrieve multiple values from multiple keys
Multi-hop
Tracing
variable_tracking num_chains: int >= 1
# number of variable name-binding chains
num_hops: int >= 1
# number of times binding variable names in each chain
Aggregation common_words_extraction freq_cw: int >= 1
# frequency of common words
freq_ucw: int >= 1
# frequency of uncommon words
num_cw: int >= 1
# number of common words
Aggregation freq_words_extraction alpha: float > 1.0
# parameter of the distribution to draw synthetic words. Reducing alpha to increase the difficulty of this task. Note that increasing the number of words to return also increases the difficulty of this task, we use 3 in our evaluations as models show worse performance at short context size when more words need to be returned.
Question
Answering
qa dataset: squad or hotpotqa
# the short-context qa dataset we use

🚀 (Optional) Contribute a new synthetic task

1. Create a python script for data preparation

2. Add task template

3. Add evaluation metric

4. Add required configurations

🛠️ Limitations

While tasks in RULER are designed to be configurable, we only evaluate the above models with 13 task configurations. These tasks were selected because most models can achieve good (some almost perfect) performance at short context size (<= 4K), which leaves ample room to observe degradation as we extend the input length. We did not include more complexed tasks in RULER that models show worse performance at short context size. We also did not stress test every model with more difficult task configurations. Although RULER covers four task categories extending previous evaluation protocol and provides a clean test bed for sanity-checking LMs with known upper bound performance, it is by no means comprehensive enough and it cannot replace the more preferred realistic tasks. We welcome people to contribute new tasks and/or new task categories to help evaluate long-context capabilities.

đź“ť Citation

@article{hsieh2024ruler,
  title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
  author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
  year={2024},
  journal={arXiv preprint arXiv:2404.06654},
}

Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.