irthomasthomas / undecidability

6 stars 2 forks source link

hsiehjackson/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? #848

Open ShellLM opened 1 month ago

ShellLM commented 1 month ago

RULER: What's the Real Context Size of Your Long-Context Language Models?

Snippet

"RULER: What's the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What's the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

Models Claimed Length Effective Length 4K 8K 16K 32K 64K 128K Avg. wAvg. (inc) wAvg. (dec) Llama2 (7B) 4K 85.6
Gemini-1.5-pro 1M >128K 96.7 95.8 96.0 95.9 95.9 94.4 95.8 95.5 (1st) 96.1 (1st) GPT-4-1106-preview 128K 64K 96.6 96.3 95.2 93.2 87.0 81.2 91.6 89.0 (2nd) 94.1 (2nd) Llama3.1 (70B) 128K 64K 96.5 95.8 95.4 94.8 88.4 66.6 89.6 85.5 (5th) 93.7 (3rd) ..."

README

TITLE

hsiehjackson/RULER: This repo contains the source code for RULER: What's the Real Context Size of Your Long-Context Language Models?

IMPORTANT

:ADD NO ADDITIONAL COMMENTARY OR TEXT OF ANY KIND, except that which is needed to sensibly render the document.

CONTENT

This repository contains code for our paper RULER: What's the Real Context Size of Your Long-Context Language Models?. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

Models Claimed Length Effective Length 4K 8K 16K 32K 64K 128K Avg. wAvg. (inc) wAvg. (dec) Llama2 (7B) 4K 85.6
Gemini-1.5-pro 1M >128K 96.7 95.8 96.0 95.9 95.9 94.4 95.8 95.5 (1st) 96.1 (1st) GPT-4-1106-preview 128K 64K 96.6 96.3 95.2 93.2 87.0 81.2 91.6 89.0 (2nd) 94.1 (2nd) Llama3.1 (70B) 128K 64K 96.5 95.8 95.4 94.8 88.4 66.6 89.6 85.5 (5th) 93.7 (3rd) Qwen2 (72B) 128K 32K 96.9 96.1 94.9 94.1 79.8 53.7 85.9 79.6 (10th) 92.3 (4th) Command-R-plus (104B) 128K 32K 95.6 95.2 94.2 92.0 84.3 63.1 87.4 82.7 (8th) 92.1 (5th) GLM4 (9B) 1M 64K 94.7 92.8 92.1 89.9 86.7 83.1 89.9 88.0 (3rd) 91.7 (6th) Llama3.1 (8B) 128K 32K 95.5 93.8 91.6 87.4 84.7 77.0 88.3 85.4 (6th) 91.3 (7th) Command-R (35B) 128K 32K 93.8 93.3 92.4 89.5 84.9 76.0 88.3 85.5 (4th) 91.1 (8th) GradientAI/Llama3 (70B) 1M 16K 95.1 94.4 90.8 85.4 80.9 72.1 86.5 82.6 (9th) 90.3 (9th) Mixtral-8x22B (39B/141B) 64K 32K 95.6 94.9 93.4 90.9 84.7 31.7 81.9 73.5 (13th) 90.3 (10th) Yi (34B) 200K 32K 93.3 92.2 91.3 87.5 83.2 77.3 87.5 84.8 (7th) 90.1 (11th) Phi3-medium (14B) 128K 32K 93.3 93.2 91.1 86.8 78.6 46.1 81.5 74.8 (12th) 88.3 (12th) Mixtral-8x7B (12.9B/46.7B) 32K 32K 94.9 92.1 92.5 85.9 72.4 44.5 80.4 72.8 (14th) 87.9 (13th) GradientAI/Llama3 (8B) 1M 16K 92.8 90.3 85.7 79.9 76.3 69.5 82.4 78.5 (11th) 86.3 (14th) FILM-7B* (7B) 32K 32K 92.8 88.2 88.1 86.9 70.1 27.1 75.5 66.4 (16th) 84.7 (15th) Mistral (7B) 32K 16K 93.6 91.2 87.2 75.4 49.0 13.8 68.4 55.6 (19th) 81.2 (16th) Mistral-Nemo 128K 16K 87.8 87.2 87.7 69.0 46.8 19.0 66.2 54.7 (20th) 77.8 (17th) GLM3 (6B) 128K 4K 87.8 83.4 78.6 69.9 56.0 42.0 69.6 62.0 (18th) 77.2 (18th) LWM (7B) 1M <4K 82.3 78.4 73.7 69.1 68.1 65.0 72.8 69.9 (15th) 75.7 (19th) Phi3-mini (3.8B) 128K 4K 86.7 78.1 75.6 70.3 58.9 43.3 68.8 62.2 (17th) 75.5 (20h) DBRX (36B/132B) 32K 8K 95.1 93.8 83.6 63.1 2.4 0.0 56.3 38.0 (21th) 74.7 (21th) Qwen1.5 (72B) 32K 8K 94.9 93.8 78.0 67.8 0.0 0.0 55.7 37.5 (22th) 74.0 (22th) Together (7B) 32K 4K 88.2 81.1 69.4 63.0 0.0 0.0 50.3 33.8 (23th) 66.7 (23th) LongChat (7B) 32K <4K 84.7 79.9 70.8 59.3 0.0 0.0 49.1 33.1 (24th) 65.2 (24th) LongAlpaca (13B) 32K <4K 60.6 57.0 56.6 43.6 0.0 0.0 36.3 24.7 (25th) 47.9 (25th)

Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases. While all models claim context size of 32k tokens or greater, only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama-2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined. Almost all models fall below the threshold before reaching the claimed context lengths.

Notes (FILM-7B)

The results are submitted by authors of this paper. They use YaRN without further training for the evaluation length exceeding 32K (64K and 128K). They do not use the one-shot example for the CWE task.

Requirements

Docker container: docker pull cphsieh/ruler:0.1.0 The requirements are listed in docker/Dockerfile and docker/requirements.txt. Use the following command to build the container based on NVIDIA's PyTorch container nvcr.io/nvidia/pytorch:23.08-py3.

cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .

Evaluate long-context LMs

  1. Download data
    • Paul Graham Essays for NIAH are downloaded from NIAH Github and Paul Graham Blog.
    • QA datasets are downloaded from SQuAD and HotpotQA.
      cd scripts/data/synthetic/json/
      python download_paulgraham_essay.py
      bash download_qa_dataset.sh
  2. Download model
    • We download the models from Hugging Face.
    • The input template of each model is stored in scripts/data/template.py. Please add new model template if your new model uses a different chat template.
    • (Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.
  3. Run evaluation pipeline
    • Setup run.sh
      GPUS="" # number of GPUs
      ROOT_DIR="" # the path that stores generated task samples and model predictions.
      MODEL_DIR="" # the path that contains individual model folders from Hugging Face.
      ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
    • Setup config_models.sh
      case $MODEL_NAME in
       YOUR_HF_MODEL_NAME)
           MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
           MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
           MODEL_FRAMEWORK="" # hf or vllm
           ;;
       YOUR_TRTLLM_ENGINE_NAME)
           MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
           MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
           MODEL_FRAMEWORK="trtllm"
           ;;
       YOUR_OPENAI_MODEL_NAME)
           MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
           MODEL_TEMPLATE_TYPE="base"
           MODEL_FRAMEWORK="openai"
           TOKENIZER_PATH="cl100k_base"
           TOKENIZER_TYPE="openai"
           OPENAI_API_KEY="" # your OpenAI API key
           ;;
       YOUR_GEMINI_MODEL_NAME)
           MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
           MODEL_TEMPLATE_TYPE="base"
           MODEL_FRAMEWORK="gemini"
           TOKENIZER_PATH=$MODEL_PATH
           TOKENIZER_TYPE="gemini"
           GEMINI_API_KEY="" # your Gemini API key
           ;;
    • Start evaluation based on our default synthetic benchmark
      bash run.sh YOUR_MODEL_NAME synthetic

(Optional) Customize task complexity

The tasks to be evaluated on are stored in scripts/config_tasks.sh. Configuration of each task is defined in scripts/synthetic.yaml. The complexity of each task can be configured by changing the arguments which we describe in detail below.

Category Task name Configurations
Retrieval niah type_haystack: repeat/essay/needle
- repeat: repeated noise sentences
- essay: Paul Graham Essays
- needle: distracted needles
type_needle_k: words/numbers/uuids
- words: adjective-noun
- numbers: 7 digits
- uuids: 32 digits
num_needle_k: int >= 1
- add multiple needles in haystack
num_needle_v: int >= 1
- retrieve multiple values from a single key
num_needle_q: int >= 1
- retrieve multiple values from multiple keys
Multi-hop Tracing variable_tracking num_chains: int >= 1
- number of variable name-binding chains
num_hops: int >= 1
- number of times binding variable names in each chain
Aggregation common_words_extraction freq_cw: int >= 1
- frequency of common words
freq_ucw: int >= 1
- frequency of uncommon words
num_cw: int >= 1
- number of common words
Aggregation freq_words_extraction alpha: float > 1.0
- parameter of the distributation to draw synthetic words. Reducing alpha to increase the difficulty of this task. Note that increasing the number of words to return also increases the difficulty of this task, we use 3 in our evaluations as models show worse performance at short context size when more words need to be returned.
Question Answering qa dataset: squad or hotpotqa
- the short-context qa dataset we use

(Optional) Contribute a new synthetic task

  1. Create a python script for data preparation
    • Add basic arguments (required) and complexity configurations in the python script.
    • Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
    • Save the script under the folder scripts/data/synthetic.
  2. Add task template
    • Add template and tokens_to_generate in scripts/data/synthetic/constants.py.
    • Add answer_predfix to prevent model from refusing to answer.
  3. Add evaluation metric
    • Add the automatic metric to evaluate your task in scripts/eval/synthetic/constants.py.
  4. Add required configurations
    • Define your task name and complexity configurations in scripts/synthetic.yaml.
    • Add your task name in scripts/config_tasks.sh.

Limitations

While tasks in RULER are designed to be configurable, we only evaluate the above models with 13 task configurations. These tasks were selected because most models can achieve good (some almost perfect) performance at short context size (<= 4K), which leaves ample room to observe degradation as we extend the input length. We did not include more complexed tasks in RULER that models show worse performance at short context size. We also did not stress test every model with more difficult task configurations. Although RULER covers four task categories extending previous evaluation protocol and provides a clean test bed for sanity-checking LMs with known

Suggested labels

None

ShellLM commented 1 month ago

Related content

456 similarity score: 0.91

625 similarity score: 0.91

772 similarity score: 0.9

811 similarity score: 0.89

304 similarity score: 0.89

762 similarity score: 0.89