Load Phi3 small 8k Instruct without Flash attention

BigDataMLexplorer commented 3 months ago

System Info

Phi3 small, flash attention, gpu

Who can help?

@ArthurZucker @muellerzr @stevhliu

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Hi, I'm trying to load a Phi 3 small Instruct 8k model. Link: https://huggingface.co/microsoft/Phi-3-small-8k-instruct

I want to use it further for fine tuning, but I can't load it with flash attention because I have a V100 graphics cards and those are incompatible So I am trying to load it without flash attention using attn_implementation="eager":

model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager" )

But I still get this error:

AssertionError: Flash Attention is not available, but is needed for dense attention

Is there any way to load the model without flash attention?

I am using the latest version of transformers 4.43.3

zucchini-nlp commented 3 months ago

Hey! Looks like you have trust_remote_code=True when loading the model and hitting the error here. Can you try loading with

model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager", trust_remote_code=False

BigDataMLexplorer commented 3 months ago

Hey! Looks like you have trust_remote_code=True when loading the model and hitting the error here. Can you try loading with

model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager", trust_remote_code=True

Thanks for the reply, but that doesn't help. I have the code stored in my repository. I work on a server without internet access and I have to load the model locally. I keep running into the same error:

model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager", trust_remote_code=True

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], line 2
      1 num_labels = 11  
----> 2 model = AutoModelForSequenceClassification.from_pretrained("Phi3small8k",num_labels=num_labels,attn_implementation= "eager", trust_remote_code=True)#,device_map="auto")

File [~/env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:559](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/env/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py#line=558), in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    557     else:
    558         cls.register(config.__class__, model_class, exist_ok=True)
--> 559     return model_class.from_pretrained(
    560         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    561     )
    562 elif type(config) in cls._model_mapping.keys():
    563     model_class = _get_model_class(config, cls._model_mapping)

File [~/env/lib/python3.9/site-packages/transformers/modeling_utils.py:3788](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/env/lib/python3.9/site-packages/transformers/modeling_utils.py#line=3787), in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3782 config = cls._autoset_attn_implementation(
   3783     config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map
   3784 )
   3786 with ContextManagers(init_contexts):
   3787     # Let's make sure we don't run the init function of buffer modules
-> 3788     model = cls(config, *model_args, **model_kwargs)
   3790 # make sure we use the model's config since the __init__ call might have copied it
   3791 config = model.config

File [~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py:1045](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py#line=1044), in Phi3SmallForSequenceClassification.__init__(self, config)
   1043 super().__init__(config)
   1044 self.num_labels = config.num_labels
-> 1045 self.model = Phi3SmallModel(config)
   1046 self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
   1048 # Initialize weights and apply final processing

File [~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py:745](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py#line=744), in Phi3SmallModel.__init__(self, config)
    742 # MuP Embedding scaling
    743 self.mup_embedding_multiplier = config.mup_embedding_multiplier
--> 745 self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
    747 self.final_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
    749 self.gradient_checkpointing = False

File [~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py:745](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py#line=744), in <listcomp>(.0)
    742 # MuP Embedding scaling
    743 self.mup_embedding_multiplier = config.mup_embedding_multiplier
--> 745 self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
    747 self.final_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
    749 self.gradient_checkpointing = False

File [~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py:651](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py#line=650), in Phi3SmallDecoderLayer.__init__(self, config, layer_idx)
    649 super().__init__()
    650 self.hidden_size = config.hidden_size
--> 651 self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
    652 self.mlp = Phi3SmallMLP(config)
    654 self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)

File [~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py:218](http://localhost:1000/lab/tree/tmp/Configuration_Project/~/.cache/huggingface/modules/transformers_modules/Phi3small8k/modeling_phi3_small.py#line=217), in Phi3SmallSelfAttention.__init__(self, config, layer_idx)
    213 if self.config.dense_attention_every_n_layers and ((self.layer_idx + 1) % self.config.dense_attention_every_n_layers == 0):
    214     logger.info(
    215         f"Layer {layer_idx + 1} is using dense attention since it is divisible by "
    216         f"{self.config.dense_attention_every_n_layers}"
    217     )
--> 218     assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
    219 else:
    220     # BlockSparse related Parameters
    221     self.blocksparse_params = BlockSparseParams.from_config(config)

AssertionError: Flash Attention is not available, but is needed for dense attention

amyeroberts commented 3 months ago

cc @Rocketknight1 As you we looking into the issue of conditional imports and the hub - looks like it's starting to block quite a few users

Rocketknight1 commented 3 months ago

Yeah, understood - I'll try to get to it as soon as I can finally finish the tool use project!

BigDataMLexplorer commented 3 months ago

@amyeroberts @Rocketknight1 Does this mean that there is currently some bug on the transformers side and this code should work under normal conditions? If so, when can I expect it to be fixed? Thank you

BigDataMLexplorer commented 3 months ago

Yeah, understood - I'll try to get to it as soon as I can finally finish the tool use project!

@Rocketknight1 Could you please describe this problem in more detail? Is the problem on the transformers side and the model should load without flash attention? Does this mean that there is currently some bug on the transformers side and this code should work under normal conditions? If so, when can I expect it to be fixed? Thank you

zucchini-nlp commented 3 months ago

@BigDataMLexplorer I did some exploration why we can't load with transformers local code, and have to trust_remote_code=True. See https://github.com/huggingface/transformers/issues/32243#issuecomment-2268285247.

TL;DR; Phi3 small seems a bit different from tiny/medium models and thus are not yet ported to transformers. That's why the above code is loading from the Hub code, which is not maintained by us. We'll consider shipping phi-3 to transformers

BigDataMLexplorer commented 3 months ago

@BigDataMLexplorer I did some exploration why we can't load with transformers local code, and have to trust_remote_code=True. See #32243 (comment).

TL;DR; Phi3 small seems a bit different from tiny/medium models and thus are not yet ported to transformers. That's why the above code is loading from the Hub code, which is not maintained by us. We'll consider shipping phi-3 to transformers

@zucchini-nlp That means if I download the Phi3 14b medium model locally, shouldn't I have this same problem as the small model?

zucchini-nlp commented 3 months ago

@BigDataMLexplorer for Phi-3 medium you should be able to switch attention to "sdpa" or "eager", as it is supported by transformers. Just make sure to trust_remote_code=False

BigDataMLexplorer commented 3 months ago

Thank you, the medium version works fine :)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers