Closed 6sixteen closed 3 months ago
cc @ArthurZucker @Rocketknight1
Yeah, so the PR to integrate Phi-3 with transformers has already been merged here. There hasn't been a stable release yet, which is why there's a difference between the pip version and when you install directly from source. So, you have to do the latter for now.
That being said, I also see the message You are not running the flash-attention implementation, expect numerical differences.
pop up & it seems like this is the first LOC in the forward func, although I do not understand what this message means myself.
Maybe @gugarosa can help, since this was his PR.
Hi! Might be out of topic, but I'm also trying to install the flash-attn package to run the Phi-3 model. I've run into an issue trying to installl the flash-attn package from its github repo with the following "python setup.py install".
I get a lot of warning like the one below, it does not crash but it does not seem to ever finish as well. Has anyone ran into something similar and was able to solve it?
(I'm using a cuda with version 12.1 and pytorch 2.3)
"/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo pruneInfo_t CUSPARSE_DEPRECATED_TYPE;
| ^~~
/usr/local/cuda-12/include/cusparse.h:4868:366: warning: 'pruneInfo_t' is deprecated: The type will be removed in the next major release [-Wdeprecated-declarations]
4868 | cusparseSpruneCsr2csrByPercentage(cusparseHandle_t handle,
| ^
/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo pruneInfo_t CUSPARSE_DEPRECATED_TYPE;
| ^~~
/usr/local/cuda-12/include/cusparse.h:4886:368: warning: 'pruneInfo_t' is deprecated: The type will be removed in the next major release [-Wdeprecated-declarations]
4886 | cusparseDpruneCsr2csrByPercentage(cusparseHandle_t handle,
| ^
/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo* pruneInfo_t CUSPARSE_DEPRECATED_TYPE;"
i am trying to use Phi-3-128k model, getting this problem
.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences. Killed
has anyone faced this error. How to solve this
Works fine after telling the pipeline which attention mechanism to use. I don't think there is a problem with Transformers lib:
!pip install transformers
!pip install flash-attn
from transformers import AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-128k-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2"
)
is_flash_attn_2_available()
# True
i still sometimes get the same issue and also CUDA is out of memory issue.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.40.1Who can help?
@Narsil No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
How to run with flash-attention? My curiosity: Is there another reason limiting flash-attention,except is_flash_attn_2_available() Why I can't find "You are not running the flash-attention implementation, expect numerical differences." Which file contains these message? How can I build the latest transformers?