phi3 model is not running in cpu

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.01k stars 26.3k forks source link

phi3 model is not running in cpu #32243

Open devang-choudhary opened 1 month ago

devang-choudhary commented 1 month ago

1) microsoft/Phi-3-small-128k-instruct is not running on cpu (ICELAKE or Graviton) the script which i was using :

Screenshot 2024-07-26 165001

Error I got for Graviton:- Screenshot 2024-07-26 162942

Error for ICELAKE :-

Screenshot 2024-07-26 163015

2) microsoft/Phi-3-mini-128k-instruct and other mini models are running on cpu but showing warning that You are not running the flash-attention implementation, expect numerical differences. Screenshot 2024-07-26 170522

but the implementation of Flash attention for cpu is available in pytorch aten native is there any flag I need to enable in order to use the flash attention for cpu.

qubvel commented 1 month ago

Probably, similar to (the same error regarding Flash Attention)

https://github.com/huggingface/transformers/issues/32201

devang-choudhary commented 1 month ago

@ArthurZucker @zucchini-nlp @fxmarty @amyeroberts can you please look into this issue and share your comments on it.

zucchini-nlp commented 1 month ago

Hey @devang-choudhary !

Similar issue was reported at https://github.com/huggingface/transformers/issues/32365. I just digged a bit, Phi3-small models are not natively supported in Transformers because their implementation is slightly different from mini/medium series. So the error you're seeing is related to code in the hub and seems like you are unable to import flash_attn. FA2 requires CUDA and particular hardware to be installed/run properly (see here and here)

Also, imo we should add native support for Phi3-small but not sure if anyone is already working on it. Nice to see if we can make it work without relying on FA2. cc @ArthurZucker for that

ArthurZucker commented 1 month ago

I think it is supported, just their checkpoints are not in the correct format. I don't remember because they are a bit messy and don't seem willing to integrate it natively 😓

zucchini-nlp commented 1 month ago

For phi-small it is also the code, because they have some interleaving of two dense/sparse attention and changes the activation function. Sad to hear they don;t want to contribute, should we work on integration ourselves then?

ArthurZucker commented 1 month ago

seems to be asked a bit, but I am not entirely sure, we can open an issue for community contribution!

zucchini-nlp commented 1 month ago

Agreed, it is too much for the community. I meant we can work on if it's asked a lot, but we'll will be slow