Open netw0rkf10w opened 3 months ago
Can you try with -e USE_FLASH_ATTENTION=True
?
@OlivierDehaene Thanks. I have tried with -e USE_FLASH_ATTENTION=True
but the results are the same. For your information I don't have this issue on CPU (using the docker image ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
).
And can you also try with -e DTYPE=float32
?
I can confirm I get the same issue with this model, on T4 and L4 GPUs, with TEI 1.5.0 and 1.4.0. I tried to put DTYPE=float32 as environment variable in HuggingFace Inference Endpoint and I still see the issue. I tried USE_FLASH_ATTENTION too.
I might be totally wrong but it might be related to a cache mechanism because if I repeat the query, sometimes I get None
and sometimes I get the proper vector.
For the L4 gpus, which image of TGI did you use? did you use ghcr.io/huggingface/text-embeddings-inference:89-1.5
or the default ghcr.io/huggingface/text-embeddings-inference:1.5
image?
Default, I think this is what HF sets up automatically. But I can double check and try 89 also.
Ok so it seems that my container was still the turing one when I switched from T4 to L4. I changed the container to default 1.5 and I don't have None
anymore. So the issue is definitely limited to the turing 1.5.0 version.
@qherreros By "switching from T4 to L4", did you mean switching the type of the GPU on your machine or switching the docker image?
I tried the default image ghcr.io/huggingface/text-embeddings-inference:1.5
but got an incompatibility error:
Error: Could not create backend
Caused by:
Could not start backend: Runtime compute cap 75 is not compatible with compile time compute cap 80
So what I did was:
@OlivierDehaene Could you let us know if you plan to fix this or whether we should give up gte-large-en-v1.5
on T4? Thank you in advance for your reply.
I don't think this might necessarily be a bug per se. The TEI docker images are built in a way such that they are GPU architecture specific. This means that they have separate image tags that are built with different architectures in mind. This kind of reduces bloat in the images and lets the user only have the absolutely essential build files in their images.
The numerical instability issues with T4 GPUs are documented here (https://github.com/huggingface/text-embeddings-inference/issues/53) which is also why Flash Attention is marked as experimental for Turing GPUs.
To solve this issue currently, I would recommend switching to an Ampere or Lovelace series of GPU and use the corresponding image tag correctly (https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images) so that there are no compatibility issues!
Disclaimer: I'm not a maintainer. So I'm not sure if @OlivierDehaene would have a different take on this.
System Info
I'm using the current docker image
ghcr.io/huggingface/text-embeddings-inference:turing-1.5
on Debian 11 with CUDA driver 12.2 and an Nvidia T4 GPU.Information
Tasks
Reproduction
Launch the server:
Then make a request:
When the input is a single short sentence, for example
{"inputs": ["Hello?"]}
or{"inputs": ["What is Deep Learning?"]}
, then I obtain all-null results:But two short sentences with different lengths works. Some examples:
{"inputs": ["Hello!"]}
: NULL{"inputs": ["What is Deep Learning?"]}
: NULL{"inputs": ["Hello!", "Hello!"]}
: NULL{"inputs": ["What is Deep Learning?", "What is Deep Learning?"]}
: correct results{"inputs": ["Hello!", "What is Deep Learning?"]}
: correct results{"inputs": ["Today is a very beautiful day."]}
: NULL{"inputs": ["Today is a very beautiful day. What do you think?"]}
: correct resultsThis does not happen with
all-MiniLM-L6-v2
for example.Expected behavior
There should be no Nulls in the output.