PhiForCausalLM does not support Flash Attention 2.0

gmittal commented 9 months ago

import torch
from transformers import AutoModelForCausalLM, AutoModel

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

Throws:

ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new

rootonchair commented 9 months ago

Hi, I would like to work on this issue

NielsRogge commented 9 months ago

Support for Phi-2 is still WIP, you can follow the progress here: https://github.com/huggingface/transformers/pull/28163

susnato commented 9 months ago

Hi @gmittal, Flash Attention is already implemented for Phi, PR

It seems that you are using the hub version of phi-2. Please use it from the library to properly enable Flash Attention. For now microsoft/phi-2, does not have the correct order of the weights to be used with the library model so please use it from susnato/phi-2.

First update to the latest transformers version -

pip install -U transformers

then run -

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("susnato/phi-2", 
    use_flash_attention_2=True, 
    torch_dtype=torch.float16)

tokenizer = AutoTokenizer.from_pretrained("susnato/phi-2")

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Let me know if this works or not.

nakranivaibhav commented 9 months ago

I would like to work on this issue

NicolasMejiaPetit commented 9 months ago

Using HF alignment notebook, DPO script gives me this error regardless of transformers version. (I already force updated with pip). When I remove flash attention from the yaml it works (after a bit of code adjustment). I am able to fine tune with one of my sft scripts using flash attention, which is the strange part.

gugarosa commented 9 months ago

Hello everyone!

This should be fixed in transformers 4.37.0.dev. If not using that version, please make sure that trust_remote_code=True when loading the model and it should work out-of-the-box with flash-attention 2.

NielsRogge commented 9 months ago

Thanks! Closing as this was fixed in https://github.com/huggingface/transformers/pull/28163

NicolasMejiaPetit commented 9 months ago

I installed from source so now i am on transformers 4.37.dev0 and i am still getting the Incompatible error, even with trust remote code set to true.

`C:\Users\PC\Documents\Code-Trainer\FineTune>py FINETUNERphiFP16.py --model_name_or_path C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 --data_path MiniCoderW.json --output_dir C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1000 --save_total_limit 10 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 10 --lr_scheduler_type "cosine" --report_to "tensorboard" --bf16 False --dataloader_num_workers 12 --optim paged_adamw_8bit WARNING:tensorflow:From C:\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

==================================================================================================== TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=12, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi\runs\Jan12_23-36-31_Nicolas, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, model_max_length=1024, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=OptimizerNames.PAGED_ADAMW_8BIT, optim_args=None, output_dir=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\TrainedPhi, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=IntervalStrategy.STEPS, save_total_limit=10, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=10, weight_decay=0.0, ) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. PAD Token: <|endoftext|> 50256 BOS Token <|endoftext|> 50256 EOS Token <|im_end|> 50295 Load tokenizer from C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2 over. Traceback (most recent call last): File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in train() File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train model = transformers.AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3497, in from_pretrained config = cls._autoset_attn_implementation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1340, in _autoset_attn_implementation cls._check_and_enable_flash_attn_2( File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 1420, in _check_and_enable_flash_attn_2 raise ValueError( ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/C:\Users\PC\Documents\NEWGEN\text-generation-webui-main\models\dolphin-2_6-phi-2/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new `

Here is the script I am using:

`import copy import random from dataclasses import dataclass, field from typing import Optional, Dict, Sequence

import torch import transformers from transformers import Trainer from datasets import load_dataset

IGNORE_INDEX = -100 EOT_TOKEN = "<|EOT|>"

def build_instruction_prompt(instruction: str): return ''' You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

Instruction:

{}

Response:

'''.format(instruction.strip()).lstrip()

@dataclass class ModelArguments: model_name_or_path: Optional[str] = field(default="deepseek-ai/deepseek-coder-6.7b-instruct")

@dataclass class DataArguments: data_path: str = field(default=None, metadata={"help": "Path to the training data."})

@dataclass class TrainingArguments(transformers.TrainingArguments): cache_dir: Optional[str] = field(default=None) optim: str = field(default="adamw_torch") model_max_length: int = field( default=512, metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."}, )

def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str): """Collects the state dict and dump to disk.""" state_dict = trainer.model.state_dict() if trainer.args.should_save: cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()} del state_dict trainer._save(output_dir, state_dict=cpu_state_dict) # noqa

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict: """Tokenize a list of strings.""" tokenized_list = [ tokenizer( text, return_tensors="pt", padding="longest", max_length=tokenizer.model_max_length, truncation=True, ) for text in strings ]

input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [
    tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
]

return dict(
    input_ids=input_ids,
    labels=labels,
    input_ids_lens=input_ids_lens,
    labels_lens=labels_lens,
)

def preprocess( sources: Sequence[str], targets: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, ) -> Dict: """Preprocess the data by tokenizing.""" examples = [s + t for s, t in zip(sources, targets)] examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)] input_ids = examples_tokenized["input_ids"]

labels = copy.deepcopy(input_ids)
for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
    label[:source_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=labels)

@dataclass class DataCollatorForSupervisedDataset(object): """Collate examples for supervised fine-tuning.""" tokenizer: transformers.PreTrainedTokenizer

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
    input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
    input_ids = [torch.tensor(x) for x in input_ids]
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
    )
    labels = [torch.tensor(x) for x in labels]
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)

    return dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
    )

def train_tokenize_function(examples, tokenizer): sources = [ build_instruction_prompt(instruction) for instruction in examples['instruction'] ] targets = [f"{output}\n{EOT_TOKEN}" for output in examples['output']] data_dict = preprocess(sources, targets, tokenizer) return data_dict

def train(): parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses()

if training_args.local_rank == 0:
    print('='*100)
    print(training_args)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    model_max_length=training_args.model_max_length,
    padding_side="right",
    use_fast=True,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print("PAD Token:", tokenizer.pad_token, tokenizer.pad_token_id)
print("BOS Token", tokenizer.bos_token, tokenizer.bos_token_id)
print("EOS Token", tokenizer.eos_token, tokenizer.eos_token_id)

if training_args.local_rank == 0:
    print("Load tokenizer from {} over.".format(model_args.model_name_or_path))

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_args.model_name_or_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)

if training_args.local_rank == 0:
    print("Load model from {} over.".format(model_args.model_name_or_path))

raw_train_datasets = load_dataset(
    'json',
    data_files=data_args.data_path,
    split="train",
    cache_dir=training_args.cache_dir
)

train_dataset = raw_train_datasets.map(
    train_tokenize_function,
    batched=True,
    batch_size=3000,
    num_proc=32,
    remove_columns=raw_train_datasets.column_names,
    load_from_cache_file=True, # not args.overwrite_cache
    desc="Running Encoding",
    fn_kwargs={ "tokenizer": tokenizer }
)

if training_args.local_rank == 0:
    print("Training dataset samples:", len(train_dataset))
    for index in random.sample(range(len(train_dataset)), 3):
        print(f"Sample {index} of the training set: {train_dataset[index]['input_ids']}, {train_dataset[index]['labels']}.")
        print(f"Sample {index} of the training set: {tokenizer.decode(list(train_dataset[index]['input_ids']))}.")

data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
data_module = dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

trainer.train()
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

if name == "main": train() `

NielsRogge commented 9 months ago

Hi @NickWithBotronics if you set trust_remote_code=True, then the code from the hub is used (in case of microsoft/phi-2 that's defined here), rather than modeling_phi.py defined natively in the Transformers library.

Hence it's recommended to convert the weights from the microsoft/phi-2 repo to a native one, which will work with Flash Attention 2. One can leverage the conversion script for that.

@ArthurZucker should we host the converted phi-2 weights as part of the Microsoft organization? Cause currently one will get a lot of mismatched keys when doing the following:

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2',
    use_flash_attention_2=True,
    torch_dtype=torch.bfloat16,
)

due to the the model in Transformers using a single matrix for queries, keys and values wheras the code on the hub uses separate matrices.

NicolasMejiaPetit commented 9 months ago

Thank you <3 !!!! that fixed that error(using the new modeling.py and converted hf format), now onto a new error that's due to my script I think?. :( ' C:\Python311\Lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda'). Traceback (most recent call last): File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 192, in train() File "C:\Users\PC\Documents\Code-Trainer\FineTune\FINETUNERphiFP16.py", line 145, in train model = transformers.AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3503, in from_pretrained model = cls(config, *model_args, *model_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 967, in init self.model = PhiModel(config) ^^^^^^^^^^^^^^^^ File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in init [PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 821, in [PhiDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 629, in init self.self_attn = PHI_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 412, in init super().init(args, **kwargs) File "C:\Users\PC.cache\huggingface\modules\transformers_modules\MiniPhi\modeling_phi.py", line 245, in init self.attention_dropout = config.attention_dropout ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\transformers\configuration_utils.py", line 265, in getattribute return super().getattribute(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'PhiConfig' object has no attribute 'attention_dropout'

C:\Users\PC\Documents\Code-Trainer\FineTune> "

Edit: fixed it by downloading the latest: Generation_config.json, Config.json, Configuration_phi.py, and Modeling_phi.py

NicolasMejiaPetit commented 9 months ago

while I got it working, the training loss was very wack. It started at 6 and went to 2 (after 3 epochs) but when I used the old config with out flash attention it was .6 to ~.29(also 3 epochs) same dataset same set up, same model. Just different config files and flash attention. I saw someone else experience the same thing on twitter.

ArthurZucker commented 9 months ago

Can you open a seperate issue for this? With a reproducible snippet

NicolasMejiaPetit commented 9 months ago

Gotcha, I’ll move to this ticket #28488

huggingface / transformers

PhiForCausalLM does not support Flash Attention 2.0 #28381

Instruction:

Response: