Open superchar opened 4 days ago
@superchar hi. I guess you can disable flash attention by setting dtype
to float32
instead of float16
in general. however, afaik, currently, there's only Flash GTE implementation, which doesn't support CPU or GPU w/o flash attn. maybe we could implement one.
System Info
Model - Alibaba-NLP/gte-multilingual-base Image - text-embeddings-inference:turing-1.5 Azure VM - Standard_NC4as_T4_v3 GPU - Nvidia Tesla T4 AKS version - 1.28.14 OS - Ubuntu 22.04 Command -
Information
Tasks
Reproduction
When executing the following request the first time:
The response is following
However, when repeating the same request the second time, I am getting:
I tried setting
USE_FLASH_ATTENTION=False
, however, it seems that this env variable is ignored for GTE models. I understand that Turing support is marked as experimental, but is there any way to run this on T4 with or without Flash Attention v1?Expected behavior
Do not get nulls instead of vector.