Open devang-choudhary opened 1 month ago
Probably, similar to (the same error regarding Flash Attention)
@ArthurZucker @zucchini-nlp @fxmarty @amyeroberts can you please look into this issue and share your comments on it.
Hey @devang-choudhary !
Similar issue was reported at https://github.com/huggingface/transformers/issues/32365. I just digged a bit, Phi3-small models are not natively supported in Transformers because their implementation is slightly different from mini/medium series. So the error you're seeing is related to code in the hub
and seems like you are unable to import flash_attn
. FA2 requires CUDA and particular hardware to be installed/run properly (see here and here)
Also, imo we should add native support for Phi3-small but not sure if anyone is already working on it. Nice to see if we can make it work without relying on FA2. cc @ArthurZucker for that
I think it is supported, just their checkpoints are not in the correct format. I don't remember because they are a bit messy and don't seem willing to integrate it natively 😓
For phi-small it is also the code, because they have some interleaving of two dense/sparse attention and changes the activation function. Sad to hear they don;t want to contribute, should we work on integration ourselves then?
seems to be asked a bit, but I am not entirely sure, we can open an issue for community contribution!
Agreed, it is too much for the community. I meant we can work on if it's asked a lot, but we'll will be slow
1) microsoft/Phi-3-small-128k-instruct is not running on cpu (ICELAKE or Graviton) the script which i was using :
Error I got for Graviton:-
Error for ICELAKE :-
2) microsoft/Phi-3-mini-128k-instruct and other mini models are running on cpu but showing warning that You are not running the flash-attention implementation, expect numerical differences.
but the implementation of Flash attention for cpu is available in pytorch aten native is there any flag I need to enable in order to use the flash attention for cpu.