This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.
Flash Attention supports only fp16 and bf16 data type for Phi-3-small-128K fine-tuning using QLoRA #127

Closed ArpitSharma7 closed 1 month ago

ArpitSharma7 commented 1 month ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

import os import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, ) from transformers import Trainer from peft import LoraConfig, PeftModel from datasets import Dataset import pandas as pd from peft import prepare_model_for_kbit_training

import datasets from datasets import Dataset from datasets import load_dataset, concatenate_datasets import numpy as np import ast import sys

import warnings warnings.filterwarnings("ignore")

model_name = "microsoft/Phi-3-small-128k-instruct"

import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 )

Load model and tokenizer

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, torch_dtype= torch.bfloat16, attn_implementation="flash_attention_2", cache_dir = "/data", trust_remote_code=True, ) model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir = "/data", trust_remote_code=True, use_fast=True,) tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = 'right' df_raw = pd.read_csv("train.csv") df = df_raw[['Question','Query', 'Schema']]

df_sorted = df.sort_values(by='Query', key=lambda x: x.str.len())

dataset = Dataset.from_pandas(df_sorted[['Question', 'Query', 'Schema']])

def prepare_dialogue_mistral(example):

question = example["Question"]
response = example["Query"]
context = example['Schema']

prompt_file = ""

with open(prompt_file, "r") as f:
    prompt =

prompt = prompt.format(
    user_question=question, table_metadata_string=context, sql_query=response

return example

dataset_formatted =, num_proc=4, remove_columns=[ 'Question', 'Query','Schema'])

test_df = pd.read_csv("val.csv") test_df = test_df[['Question','Query', 'Schema']]

test_dataset = Dataset.from_pandas(test_df[['Question', 'Query', 'Schema']]) test_dataset_formatted =, num_proc=4, remove_columns=[ 'Question', 'Query','Schema'])

from datasets import Dataset, DatasetDict

dataset_split = DatasetDict({"train": dataset_formatted, "test": test_dataset_formatted})

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model peft_config = LoraConfig( target_modules = ["query_key_value","dense","down_proj","up_proj", "lm_head"], lora_alpha=128, lora_dropout=0.05, r=128, bias="none", task_type="CAUSAL_LM" )

model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model)

from transformers import TrainingArguments

args = TrainingArguments( output_dir="phi3-small_res", num_train_epochs=3, per_device_train_batch_size=2, per_device_eval_batch_size=1, gradient_accumulation_steps=2, gradient_checkpointing=False, optim="adamw_torch", learning_rate=5e-06, bf16=True, max_grad_norm=0.3, logging_steps = 560, evaluation_strategy = "steps", save_strategy='epoch', warmup_ratio=0.01, lr_scheduler_type="cosine", )

from trl import SFTTrainer max_seq_length = 4096

trainer = SFTTrainer( model=model, train_dataset=dataset_split['train'], eval_dataset=dataset_split['test'], data_collator=collator, dataset_text_field="text", peft_config=peft_config, max_seq_length=max_seq_length, tokenizer=tokenizer, packing=False, args=args )


Any log messages given by the failure

RuntimeError Traceback (most recent call last) Cell In[2], line 159
[... traceback details omitted ...]

RuntimeError: FlashAttention only support fp16 and bf16 data type

Expected/desired behavior

Training doesn't start currently, ideally should start

OS and Version?

Linux 6.5.0-45-generic

azd version?

None, not using azd


numpy==1.26.4 torch==2.3.0 ninja== transformers==4.41.1 bitsandbytes==0.41.3.post1 tiktoken==0.6.0 triton==2.3.0 flash-attn==2.5.8

Mention any other details that might be useful

Thanks! We'll be in touch soon.

leestott commented 1 month ago

@ArpitSharma7 can you confirm which sample from the cookbook you were using? If this is a general issues please log in on the hugging face discussions

ArpitSharma7 commented 1 month ago

@leestott Its from this notebook. Finetuning/Phi-3-finetune-qlora-python.ipynb However this notebook is based on Phi-3-mini, I just replaced the model with phi-3-small-128k-instruct

skytin1004 commented 1 month ago

Hi @ArpitSharma7,

The error might be due to compatibility issues with the torch version. This problem is discussed in this issue. Have you checked if the torch version you are using is 2.2 or below?

ArpitSharma7 commented 1 month ago

@skytin1004 I have torch 2.3.0 installed because in the huggingface model page for phi-3-small-128K, triton 2.3.0 is a requirement which is compatible with only pytorch 2.3.0. In the link you mentioned they were using some nightly version of 2.3.0, I have the stable version. Not sure if that is the problem. Also if you are able to run Qlora Finetuning script with Phi3-small-128K model, let me know the library versions that you have, to confirm whether it is an environment issue

skytin1004 commented 1 month ago

@ArpitSharma7 I've recently reconfigured my environment to use torch version 2.3.1 and confirmed that flash-attention2 works well with the following setup. If flash-attention2 still does not work properly in your environment, I recommend using eager instead.

Environment Setup:

!pip install torch==2.3.1
!pip install bitsandbytes==0.43.1
!pip install transformers==4.4.1
!pip install peft==0.12.0
!pip install accelerate==0.33.0
!pip install datasets==2.19.1
!pip install trl==0.8.6
!pip install flash_attn==2.6.3

GPU: A100

CUDA Version: 12.1.105

If flash-attention2 does not work:

Use eager instead of flash-attention2 by replacing:

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'


if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'eager'
superctj commented 5 days ago
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(

I am able to get around the error by passing in torch_dtype and attn_implementation when initiating the model (assuming using the transformers library).

skytin1004 commented 3 days ago
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(

I am able to get around the error by passing in torch_dtype and attn_implementation when initiating the model (assuming using the transformers library).

Hi @superctj,

Thank you for sharing your solution. Currently, the guide uses the following logic to determine which type and attention implementation to use:

if torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
    attn_implementation = 'flash_attention_2'
    compute_dtype = torch.float16
    attn_implementation = 'sdpa'

This logic checks if the GPU supports bfloat16 and sets the attention implementation to flash_attention_2 accordingly. If bfloat16 is not supported, it falls back to using float16 and sdpa.

It seems that passing torch_dtype andattn_implementation directly when initializing the model works well in your case. Do you have any recommendations for improving the current logic, especially for handling different GPU configurations?

superctj commented 2 days ago

Hi @skytin1004, I had problems running Phi-3-small-8k-instruct and Phi-3.5-mini-instruct from the transformers library on A40 GPU (see a similar issue here). After fixing this error, I saw warnings of numeric differences without using the flash attention. So I followed the instructions in Hugging Face documentation to enable the flash attention.