Numerical issues with gte-large-en-v1.5

netw0rkf10w commented 3 months ago

System Info

I'm using the current docker image ghcr.io/huggingface/text-embeddings-inference:turing-1.5 on Debian 11 with CUDA driver 12.2 and an Nvidia T4 GPU.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Launch the server:

volume="/home/user/model_zoo" && docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:turing-1.5 --model-id "/data/gte-large-en-v1.5"

Then make a request:

curl 0.0.0.0:8080/embed     -X POST     -d '{"inputs": ["Hello?"]}'     -H 'Content-Type: application/json'

When the input is a single short sentence, for example {"inputs": ["Hello?"]} or {"inputs": ["What is Deep Learning?"]}, then I obtain all-null results:

[[null,null,...,null,null]]

But two short sentences with different lengths works. Some examples:

{"inputs": ["Hello!"]}: NULL
{"inputs": ["What is Deep Learning?"]}: NULL
{"inputs": ["Hello!", "Hello!"]}: NULL
{"inputs": ["What is Deep Learning?", "What is Deep Learning?"]}: correct results
{"inputs": ["Hello!", "What is Deep Learning?"]}: correct results
{"inputs": ["Today is a very beautiful day."]}: NULL
{"inputs": ["Today is a very beautiful day. What do you think?"]}: correct results

This does not happen with all-MiniLM-L6-v2 for example.

Expected behavior

There should be no Nulls in the output.

OlivierDehaene commented 3 months ago

Can you try with -e USE_FLASH_ATTENTION=True?

netw0rkf10w commented 3 months ago

@OlivierDehaene Thanks. I have tried with -e USE_FLASH_ATTENTION=True but the results are the same. For your information I don't have this issue on CPU (using the docker image ghcr.io/huggingface/text-embeddings-inference:cpu-1.5).

OlivierDehaene commented 3 months ago

And can you also try with -e DTYPE=float32?

qherreros commented 3 months ago

I can confirm I get the same issue with this model, on T4 and L4 GPUs, with TEI 1.5.0 and 1.4.0. I tried to put DTYPE=float32 as environment variable in HuggingFace Inference Endpoint and I still see the issue. I tried USE_FLASH_ATTENTION too. I might be totally wrong but it might be related to a cache mechanism because if I repeat the query, sometimes I get None and sometimes I get the proper vector.

OlivierDehaene commented 3 months ago

For the L4 gpus, which image of TGI did you use? did you use ghcr.io/huggingface/text-embeddings-inference:89-1.5 or the default ghcr.io/huggingface/text-embeddings-inference:1.5 image?

qherreros commented 3 months ago

Default, I think this is what HF sets up automatically. But I can double check and try 89 also.

qherreros commented 3 months ago

Ok so it seems that my container was still the turing one when I switched from T4 to L4. I changed the container to default 1.5 and I don't have None anymore. So the issue is definitely limited to the turing 1.5.0 version.

netw0rkf10w commented 3 months ago

@qherreros By "switching from T4 to L4", did you mean switching the type of the GPU on your machine or switching the docker image?

I tried the default image ghcr.io/huggingface/text-embeddings-inference:1.5 but got an incompatibility error:

Error: Could not create backend

Caused by:
    Could not start backend: Runtime compute cap 75 is not compatible with compile time compute cap 80

qherreros commented 3 months ago

So what I did was:

Use a T4 GPU with turing-1.5.0 text-embeddings-inference, I had numerical issues
Switch GPU to L4 and keep turing-1.5.0 text-embeddings-inference, I had numerical issues
Keep L4 GPU and switch to 1.5.0 (default), I had no numerical issues

netw0rkf10w commented 3 months ago

@OlivierDehaene Could you let us know if you plan to fix this or whether we should give up gte-large-en-v1.5 on T4? Thank you in advance for your reply.

vrdn-23 commented 3 months ago

I don't think this might necessarily be a bug per se. The TEI docker images are built in a way such that they are GPU architecture specific. This means that they have separate image tags that are built with different architectures in mind. This kind of reduces bloat in the images and lets the user only have the absolutely essential build files in their images.

The numerical instability issues with T4 GPUs are documented here (https://github.com/huggingface/text-embeddings-inference/issues/53) which is also why Flash Attention is marked as experimental for Turing GPUs.

To solve this issue currently, I would recommend switching to an Ampere or Lovelace series of GPU and use the corresponding image tag correctly (https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images) so that there are no compatibility issues!

Disclaimer: I'm not a maintainer. So I'm not sure if @OlivierDehaene would have a different take on this.

huggingface / text-embeddings-inference