microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
810 stars 38 forks source link

[Question]: RuntimeError encountered when trying to reproduce results in needle in a haystack #88

Open lepangdan opened 4 days ago

lepangdan commented 4 days ago

Describe the issue

Hi,

Thanks again for your help. I encountered an error while reproducing results in needle_in_a_haystack by running bash experiments/needle_in_a_haystack/run_needle.shand would appreciate any insights:

[   1000   72357  143714  215071  286429  357786  429143  500500  571857
  643214  714571  785929  857286  928643 1000000]
[ 286429  357786  429143  500500  571857  643214  714571  785929  857286
  928643 1000000]
# Too long, ignore some logs
 File "/home/far/MInference/minference/modules/minference_forward.py", line 656, in forward
    part_o = self.gather_last_q_vertical_slash_topk_v4(part_q, part_k, part_v, head)
  File "/home/far/MInference/minference/modules/minference_forward.py", line 463, in gather_last_q_vertical_slash_topk_v4
    return fc(q, k, v, vertical_size, slash_size)
  File "/home/far/MInference/minference/modules/minference_forward.py", line 383, in vertical_and_slash_kernel
    slash = sum_all_diagonal_matrix(qk)[...,:-last_q + 1]
  File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
    zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
  File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
    zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I noticed the error only occurs when starting from job 4 with the --kv_cache_cpu argument. Jobs in the range [0-4) work fine. Any suggestions on this?

Additionally, I found that the vllm module is required when performing the needle_in_a_haystack experiment. In my opinion, vllm isn't necessary for minference. Is there a specific reason for this, or something I might have missed?

Looking forward to your response!

iofu728 commented 4 days ago

Hi @lepangdan, thanks for your feedback.

It doesn't seem to be related to vLLM. It might be due to GPU memory not being fully reclaimed yet. Could you try running the Python command separately or upgrading Triton?

python experiments/needle_in_a_haystack/needle_test.py \
    --model_name gradientai/Llama-3-8B-Instruct-Gradient-1048k \
    --max_length 1000000 \
    --min_length 1000 \
    --rounds 5 \
    --attn_type minference \
    --kv_cache_cpu \
    --output_path ./needle \
    --run_name minference_LLaMA_1M \
    --jobs 4-15
lepangdan commented 3 days ago

Hi @iofu728 ,

The error persists after running your mentioned command. Any further suggestions?

Additionally, could you please confirm the A100 count and total GPU memory used for running the needle experiment?

iofu728 commented 2 days ago

Hi @lepangdan,

For the NIAH experiments, we used a single A100 GPU with 216GB CPU memory for inputs up to 800K tokens, while 900K and 1M tokens were tested on a setup with a single A100 GPU and 1TB CPU memory.

Could you try setting specific job ranges like “5-6” or “6-7”? Let me know if you encounter any issues!