microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
682 stars 23 forks source link

[Question]: Errors when I reproduce results in Table 5 (MInference + SnapKV) & poor results with attn_type=minference_with_dense #55

Closed HayeonLee closed 1 month ago

HayeonLee commented 1 month ago

Describe the issue

Hi, thank you for sharing the awewome work!

I am trying to reproduce the results in Table 5 and there are two problems.

1) MInference + SnapKV: I encountered 'RuntimeError: CUDA error: an illegal memory access was encountered.' So, I have a difficulty to reproduce the results

image

2) Additional question regarding attn_type: minference_with_dense. I thought it is the same model with flash attention version full attention, but it showed the poor results compared with full attention or attn_type: minference.

I used a single A100.

Could you please help me to fix this situattions? Any comments are welcome. Thank you!

iofu728 commented 1 month ago

Hi @HayeonLee,

Thanks for your support with MInference.

  1. I did some preliminary testing, and it appears that this issue is caused by updates in the Transformers' past_key_value management mechanism. Currently, you can resolve this by using pip install transformers==4.41.2. We will look into how we can fix this issue in our package in the future.
  2. Have you installed Flash Attention locally? If so, there should be no precision differences. If not, the precision differences are due to the variations between the CUDA version of Flash Attention and the Triton version of Flash Attention.
HayeonLee commented 1 month ago

Hi @iofu728, those worked for me! Thank you for the valuable comments:D