microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
285 stars 11 forks source link

[Question]: For the tests such as RULER and InfiniteBench mentioned in the paper, what datasets are used to search for patterns? #16

Open hijkzzz opened 4 days ago

hijkzzz commented 4 days ago

Describe the issue

No response

iofu728 commented 4 days ago

Hi @hijkzz, thanks for your interest in MInference.

We use a single example from the KV retrieval dataset to offline search for the optimal sparse head pattern, which we then apply to all evaluation benchmarks. The motivation is that the sparse pattern maintains consistency across various tasks, although we select a highly dynamic task for the search.

You can follow this instruction to search for the sparse head pattern.

cd experiments/infinite_bench
python run_infinitebench.py \
    --task kv_retrieval \
    --model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
    --data_dir ./data \
    --output_dir ./results \
    --max_seq_length 30000 \
    --rewrite \
    --is_search \
    --start_example_id 3 \
    --topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
    --num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference
hijkzzz commented 4 days ago

Hi @hijkzz, thanks for your interest in MInference.

We use a single example from the KV retrieval dataset to offline search for the optimal sparse head pattern, which we then apply to all evaluation benchmarks. The motivation is that the sparse pattern maintains consistency across various tasks, although we select a highly dynamic task for the search.

You can follow this instruction to search for the sparse head pattern.

cd experiments/infinite_bench
python run_infinitebench.py \
    --task kv_retrieval \
    --model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
    --data_dir ./data \
    --output_dir ./results \
    --max_seq_length 30000 \
    --rewrite \
    --is_search \
    --start_example_id 3 \
    --topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
    --num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference

The difficult part to understand is why the accuracy in the experimental results is even higher than that of full attention. Is it because these test task patterns are more closely related to the offline search dataset?

iofu728 commented 4 days ago

Hi @hijkzz, thanks for your interest in MInference. We use a single example from the KV retrieval dataset to offline search for the optimal sparse head pattern, which we then apply to all evaluation benchmarks. The motivation is that the sparse pattern maintains consistency across various tasks, although we select a highly dynamic task for the search. You can follow this instruction to search for the sparse head pattern.

cd experiments/infinite_bench
python run_infinitebench.py \
    --task kv_retrieval \
    --model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
    --data_dir ./data \
    --output_dir ./results \
    --max_seq_length 30000 \
    --rewrite \
    --is_search \
    --start_example_id 3 \
    --topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
    --num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference

The difficult part to understand is why the accuracy in the experimental results is even higher than that of full attention. Is it because the patterns/similarities in these tasks are more closely aligned with the search dataset?

Thank you for your great question. Based on our analysis of the cases and results, we believe that full attention may over-distribute its focus in long-context scenarios, while sparse attention can enhance model performance by concentrating on more relevant information. This phenomenon has also been observed in baselines such as StreamingLLM. However, StreamingLLM typically performs better on tasks that primarily require local information. Our method is more effective at preserving information in more dynamic tasks.

Regarding whether "these tasks are more closely aligned with the search dataset," we do not think this is the case. First, we used only a single example for the search, aiming to leverage the static patterns within dynamic scenarios to determine sparse attention. Second, sparse attention is highly dynamic across different tasks and different inputs, making it difficult to generalize using the searched index, as evidenced by the Ours w/ static results.

hijkzzz commented 4 days ago

Hi @hijkzz, thanks for your interest in MInference. We use a single example from the KV retrieval dataset to offline search for the optimal sparse head pattern, which we then apply to all evaluation benchmarks. The motivation is that the sparse pattern maintains consistency across various tasks, although we select a highly dynamic task for the search. You can follow this instruction to search for the sparse head pattern.

cd experiments/infinite_bench
python run_infinitebench.py \
    --task kv_retrieval \
    --model_name_or_path gradientai/Llama-3-8B-Instruct-262k \
    --data_dir ./data \
    --output_dir ./results \
    --max_seq_length 30000 \
    --rewrite \
    --is_search \
    --start_example_id 3 \
    --topk_dims_file_path Llama_3_8B_Instruct_262k_kv_out_v32_fit_o_best_pattern.json \
    --num_eval_examples 20 --topk 1 --starting_layer 0 --attn_type minference

The difficult part to understand is why the accuracy in the experimental results is even higher than that of full attention. Is it because the patterns/similarities in these tasks are more closely aligned with the search dataset?

Thank you for your great question. Based on our analysis of the cases and results, we believe that full attention may over-distribute its focus in long-context scenarios, while sparse attention can enhance model performance by concentrating on more relevant information. This phenomenon has also been observed in baselines such as StreamingLLM. However, StreamingLLM typically performs better on tasks that primarily require local information. Our method is more effective at preserving information in more dynamic tasks.

Regarding whether "these tasks are more closely aligned with the search dataset," we do not think this is the case. First, we used only a single example for the search, aiming to leverage the static patterns within dynamic scenarios to determine sparse attention. Second, sparse attention is highly dynamic across different tasks and different inputs, making it difficult to generalize using the searched index, as evidenced by the Ours w/ static results.

If I understand correctly, can the model achieve a similar performance by lowering the attention temperature? And then fine-tune with just a single sample?

iofu728 commented 4 days ago

I'm not entirely sure if SFT with modified temperature can produce similar results. There are two main considerations:

  1. Lowering the temperature will make the attention scores sharper. For Long-context LLMs, the attention over-distribution may not only involve small values but also some larger ones. These sparse attention or SSM methods effectively introduce spatial priors.

  2. These small values are not necessarily meaningless. For example, in the lower layers, these small values might help in information transmission.