Warning issue with flash attention

microsoft / Phi-3CookBook

This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.

MIT License

2.25k stars 224 forks source link

Warning issue with flash attention #115

Closed HarikrishnanK9 closed 1 month ago

HarikrishnanK9 commented 1 month ago

WARNING:transformers_modules.microsoft.Phi-3-mini-4k-instruct.c1358f8a35e6d2af81890deffbbfa575b978c62f.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences. You are not running the flash-attention implementation, expect numerical differences.

None

skytin1004 commented 1 month ago

Hi @HarikrishnanK9, Thank you for bringing up this issue. The warning message indicates that you are not running the flash-attention implementation, which may result in numerical differences. However, I want to assure you that this does not affect the actual fine-tuning process. Using flash-attention can provide certain performance benefits, but it is not essential for fine-tuning.

Some tutorials may use other methods, such as using eager attention instead of flash-attention, which can trigger the warning mentioned. Again, this warning does not affect the fine-tuning process itself.

skytin1004 commented 1 month ago

@HarikrishnanK9 If you wish to use "flash_attention_2," you can download the flash_attn package by running the following command:

pip install flash_attn

Then, update the model configuration as shown below:

model_kwargs = {
    "use_cache": False,
    "trust_remote_code": True,
    "torch_dtype": torch.bfloat16,
    "device_map": None,
    "attn_implementation": "flash_attention_2"
}

Please note that "flash_attention_2" is only available on certain GPUs.

For more information, you may find these documents helpful as they describe the fine-tuning process using flash_attention:

HarikrishnanK9 commented 1 month ago

Thank You @skytin1004 The issue is resolved "attn_implementation": "flash_attention_2" in model kwargs worked for me.

vizsatiz commented 6 days ago

I just got the error

The followingmodel_kwargsare not used by the model: ['attn_implementation'] (note: typos in the generate arguments will also show up in this list)

@skytin1004 Any idea why ? I am running the interference API on A100 series Nvidia