The xformers result can not match with norm attention result

System Info

Collecting environment information... PyTorch version: 1.13.0 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64) GCC version: (GCC) 7.5.0 Clang version: Could not collect CMake version: version 3.22.0 Libc version: glibc-2.32

Python version: 3.8.13 (default, Oct 21 2022, 23:50:54) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.10.112-005.ali5000.alios7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.3.58 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 470.154 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.4.0 /usr/lib64/libcudnn_adv_infer.so.8.4.0 /usr/lib64/libcudnn_adv_train.so.8.4.0 /usr/lib64/libcudnn_cnn_infer.so.8.4.0 /usr/lib64/libcudnn_cnn_train.so.8.4.0 /usr/lib64/libcudnn_ops_infer.so.8.4.0 /usr/lib64/libcudnn_ops_train.so.8.4.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.4 [pip3] torch==1.13.0+cu111 [pip3] torchaudio==0.11.0 [pip3] torchvision==0.14.0 [conda] No relevant packages

A matching Triton is not available, some optimizations will not be enabled. Error caught was: module 'triton.language' has no attribute 'constexpr' A matching Triton is not available, some optimizations will not be enabled. Error caught was: module 'triton.language' has no attribute 'constexpr' xFormers 0.0.15.dev+103e863.d20221125 memory_efficient_attention.flshatt: available - requires GPU with compute capability 7.5+ memory_efficient_attention.cutlass: available memory_efficient_attention.small_k: available swiglu.fused.p.cpp: available is_triton_available: False is_functorch_available: False pytorch.version: 1.13.0 pytorch.cuda: available gpu.compute_capability: 8.0 gpu.name: NVIDIA A100-SXM4-80GB

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I use the gpt-neox model to inference. And try to modify the _attn with xformers to speedup, but the generate output result is wrong with use_cache=True while correct with use_cache=False. I modify from #24653 by replacing _attn function of GPTNeoXAttention with code below

    def _xformers_attn(self, query, key, value, **kwargs):
        # q, k, v: [bs, num_attention_heads, seq_len, attn_head_size]
        # xformers Input tensors must be in format [B, M, H, K], where B is the batch size, M the sequence length, H the number of heads, and K the embeding size per head

        # [bs, num_attention_heads, seq_len, attn_head_size] -> [bs, seq_len, num_attention_heads, attn_head_size]
        query = query.transpose(1, 2).to(value.dtype)
        key = key.transpose(1, 2).to(value.dtype)
        value = value.transpose(1, 2)

        # org [bs, num_attention_heads, seq_len, attn_head_size]
        # xformers return multi-head attention Tensor with shape [B, Mq, H, Kv]
        output = xops.memory_efficient_attention(
            query, key, value, op=xops.MemoryEfficientAttentionFlashAttentionOp,
            attn_bias=xops.LowerTriangularMask(),
            p=self.config.attention_dropout if self.training else 0.0
        )
        # [b, sq, np, hn] -> [b, np, sq, hn]
        matmul_result = output.transpose(1, 2)

        return matmul_result.to(query.dtype), None

The generate output is correct with use_cache=False, while wrong with use_cache=True (first token is right but the latter ones are wrong). here is the generate output with use_cache=True

And I have test the output of _attn and _xformers_attn in https://github.com/facebookresearch/xformers/issues/798 , which is correct.

Expected behavior

I want to speed up the attention with xformers.

huggingface / transformers