Closed Jaykumaran closed 10 months ago
Hi @Jaykumaran, thanks for raising this issue!
Could you run the following in the command line to check the version of flash-attn being run in your python environment:
python -c "import flash_attn; from transformers.utils.import_utils import is_flash_attn_greater_or_equal_2_10; print(flash_attn.__version__); print(is_flash_attn_greater_or_equal_2_10())"
?
I also have this issue. When I run python -c "import flash_attn; from transformers.utils.import_utils import is_flash_attn_greater_or_equal_2_10; print(flash_attn.__version__); print(is_flash_attn_greater_or_equal_2_10())"
The results are:
2.5.8
True
Hi @manliu1225, could you provide:
transformers-cli env
in the terminal and copy-paste the outputHaving the same issue. Happens at the second line. I'm running on Google Colab environment.
model_name = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
Error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[<ipython-input-31-675821a17052>](https://localhost:8080/#) in <cell line: 5>()
3
4 # Load base model
----> 5 model = AutoModelForCausalLM.from_pretrained(model_name)
10 frames
[~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-vision-128k-instruct/7b92b8c62807f5a98a9fa47cdfd4144f11fbd112/modeling_phi3_v.py](https://localhost:8080/#) in <module>
37 )
38 from transformers.modeling_utils import PreTrainedModel
---> 39 from transformers.utils import (
40 add_code_sample_docstrings,
41 add_start_docstrings,
ImportError: cannot import name 'is_flash_attn_greater_or_equal_2_10' from 'transformers.utils' (/usr/local/lib/python3.10/dist-packages/transformers/utils/__init__.py)
Hi @Srikor, could you share your running environment (run transformers-cli env
in the terminal and copy-paste the output)? I'm unable to replicate this with and without flash attention installed in my environment, when running on the development branch.
Hello @amyeroberts. I'm running this in Google Colab free version and hence couldn't execute the command you provided in a terminal. I tried a fresh notebook and ran the below minimal code after installing transformers and datasets package from pip and ran into another error related to flash attention.
Code:
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
pipeline,
logging,
)
model_name = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
Error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[<ipython-input-3-0986d235c6b3>](https://localhost:8080/#) in <cell line: 2>()
1 model_name = "microsoft/Phi-3-vision-128k-instruct"
----> 2 model = AutoModelForCausalLM.from_pretrained(model_name)
3 frames
[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in _check_and_enable_flash_attn_2(cls, config, torch_dtype, device_map, check_device_map, hard_check_only)
1569
1570 if importlib.util.find_spec("flash_attn") is None:
-> 1571 raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
1572
1573 flash_attention_version = version.parse(importlib.metadata.version("flash_attn"))
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
---------------------------------------------------------------------------
Just curious that the documentation https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 indicates I need to pass attn_implementation parameter to enable flash attention, but the error indicates it has been enabled.
@Srikor Thanks for your reply
I'm running this in Google Colab free version and hence couldn't execute the command you provided in a terminal.
It should still be possible, even if the colab is free. To run a CLI command in he notebook, you need to run with a !
at the start i.e. ! transformers-cli env
but the error indicates it has been enabled.
It's been enabled because of the model implementation on the hub has a flash attention class implemented. In this case, it will automatically be selected (which is admittedly not ideal, as it can lead to unexpected behaviour).
You can select the attention implementation run when instantiating the model, or its config by setting e.g. attn_implementation="eager"
.
Hello @amyeroberts. PFB the requested details.
transformers
version: 4.41.2Hello @amyeroberts. Really sorry I just noticed that I have been running the model on CPU instead of GPU. Switched to GPU and enabled flash attention and its working fine now.
System Info
!pip install trl transformers==4.35.2 accelerate peft==0.6.2 -Uqqq
!pip install trl transformers accelerate peft==0.6.2 -Uqqq !pip install datasets bitsandbytes einops wandb -Uqqq !pip install flash-attn --no-build-isolation -Uqq
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
!pip install trl transformers==4.35.2 accelerate peft==0.6.2 -Uqqq
!pip install trl transformers accelerate peft==0.6.2 -Uqqq !pip install datasets bitsandbytes einops wandb -Uqqq !pip install flash-attn --no-build-isolation -Uqq
MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"
bnb_config = BitsAndBytesConfig( load_in_4bit=True, # load model in 4-bit precision bnb_4bit_quant_type="nf4", # pre-trained model should be quantized in 4-bit NF format bnb_4bit_use_double_quant=True, # Using double quantization as mentioned in QLoRA paper bnb_4bit_compute_dtype=torch.bfloat16,
During computation, pre-trained model should be loaded in BF16 format
)
model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, quantization_config = bnb_config, device_map = 0, use_cache=True, trust_remote_code=True, use_flash_attention_2 = True )
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Expected behavior
when trying to load the model,it results in following error. RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback): cannot import name 'is_flash_attn_greater_or_equal_2_10' from 'transformers.utils' (/usr/local/lib/python3.10/dist-packages/transformers/utils/init.py)