Attention implementation through torch.nn.functional.scaled_dot_product_attention not supported

I followed the steps in the README and copied the 3 modeling files modeling_mistral.py, modeling_utils.py and configuration_mistral.py into my transformers folders:

Target Folders for changed files: lib/python3.11/site-packages/transformers/ /lib64/python3.11/site-packages/transformers

Clone the repository to your local machine and copy the modeling files into transformers/src/transformers/models/mistral

When initializing the weights specify the self_extend attention mechanism as such:

model = MistralForCausalLM.from_pretrained("hf_mistral-7B-v0.1", attn_implementation="self_extend")

Running the model results in the following error:

lib64/python3.11/site-packages/transformers/modeling_utils.py", line 1491, in _check_and_enable_sdpa
    raise ValueError(
ValueError: MistralForCausalLM does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new

Versions:

CUDA Toolkit: 12.3
Python: 3.11
transformers: 4.36.1
torch: 2.1.1

sdan / selfextend

Attention implementation through torch.nn.functional.scaled_dot_product_attention not supported #5