huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.6k stars 162 forks source link

Different behavior between SentenceTransformer and TEI when using gte-large-en-v1.5 #358

Closed Smityz closed 1 hour ago

Smityz commented 1 month ago

System Info

$ text-embeddings-router --version
text-embeddings-router 1.5.0

Information

Tasks

Reproduction

# using TEI
model=Alibaba-NLP/gte-large-en-v1.5
text-embeddings-router --model-id $model --port 8080
curl -X POST "http://localhost:8080/embeddings" \
     -H "Content-Type: application/json" \
     -d '{"input":["Dimension table for main account?"]}'

<Response [200]>
Alibaba-NLP/gte-large-en-v1.5
[-0.0006371783,-0.03931647,-0.010235489,-0.019322978,-0.014273809,0.022573953]
# using SentenceTransformer
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Alibaba-NLP/gte-large-en-v1.5",trust_remote_code=True)
embeddings = model.encode(['Dimension table for main account?'])
print(list(embeddings[0][:6]))

[-0.015188057, -0.9458093, -0.24485634, -0.4617836, -0.3435278, 0.53972]

When using SentenceTransformer, it will download a new model named Alibaba-NLP/new-impl, but TEI may use the original model.

/home/smilencer/miniconda3/envs/ml/lib/python3.12/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
configuration.py: 7.13kB [00:00, 25.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling.py: 59.0kB [00:00, 350kB/s]
A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

Is there anyway to make TEI to use Alibaba-NLP/new-impl? I tried to modify the repo files ref https://huggingface.co/Alibaba-NLP/new-impl/discussions/2, but it's not working.

Expected behavior

the embedding results are the same

satishlokkoju commented 1 day ago

We face the same issue for "Alibaba-NLP/gte-Qwen2-1.5B-instruct" Model. Local endpoint using docker

model=Alibaba-NLP/gte-Qwen2-1.5B-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model

Call the endpoint

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"}' \
    -H 'Content-Type: application/json'

Output[:6]:

[-0.02018845,0.011100909,-0.016257592,-0.028433666,0.008306849,0.011052972,0.043034]

Using Sentence Transformers / transformer library


import torch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=False)

model.max_seq_length = 8192

queries = [
    "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
]

query_embeddings = model.encode(queries)

Output[:6]

[-0.01031005, -0.0228409 , -0.00298887, -0.01025489, -0.00989748, 0.01034117]

Using transformers

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

print("CUDA available:", torch.cuda.is_available())

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
torch.set_default_device(device) 
def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

queries = [
    'search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten'
]

tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=False)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=False)
model = model.to(device)
max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(queries, max_length=max_length, padding=True, truncation=True, return_tensors='pt')

# Move tensors to the correct device, but keep them as integers
batch_dict = {k: v.to(device) for k, v in batch_dict.items()}

print("Model device:", next(model.parameters()).device)
print("Input device:", batch_dict['input_ids'].device)
print("Input dtype:", batch_dict['input_ids'].dtype)

# Convert the model's output to float16 if necessary
with torch.cuda.amp.autocast(dtype=torch.float16):
    outputs = model(**batch_dict)

embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

output

[-0.0103, -0.0229, -0.0030, -0.0102, -0.0098,  0.0105]

Similar to Sentence Transformers but completely different from the output using docker generated with TEI.

satishlokkoju commented 1 day ago

Also, we observed that flash attention is used by default for Qwen2 in TEI. We tried enabling flash attention in Sentence transformers library but observed similar variation as above.

OlivierDehaene commented 3 hours ago

@Smityz,

It is just that for /embeddings, normalisation is set.

curl -X POST "http://localhost:8080/embeddings" \
     -H "Content-Type: application/json" \
     -d '{"input":["Dimension table for main account?"]}'
# [[-0.0006319029,-0.039351847,-0.010187608,-0.019213213,-0.014293012,0.022455906, ...]

curl -X POST "http://localhost:8080" \
     -H "Content-Type: application/json" \
     -d '{"inputs":["Dimension table for main account?"], "normalize": true}'
# [[-0.0006319029,-0.039351847,-0.010187608,-0.019213213,-0.014293012,0.022455906, ...]

curl -X POST "http://localhost:8080" \
     -H "Content-Type: application/json" \
     -d '{"inputs":["Dimension table for main account?"], "normalize": false}'
# [[-0.015187595,-0.9458097,-0.24485607,-0.46178377,-0.3435282,0.5397209,...]
OlivierDehaene commented 1 hour ago

@satishlokkoju,

Regarding Alibaba-NLP/gte-Qwen2-1.5B-instruct, there are multiple factors at play:

  1. For this model, Alibaba made some modifications to the Qwen2 architecture and you need to set trust_remote_code=True otherwise the embeddings are not correct

  2. There is a bug in Alibaba's code where a default value is not set correctly. See this thread where I pointed out the bug to their team. Unfortunately they failed to patch it correctly.

TEI re-implements the model entirely and does not suffer from both issues.

If we properly set all these arguments we have the following code:

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel

print("CUDA available:", torch.cuda.is_available())

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
torch.set_default_device(device) 
def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

queries = [
    'search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten'
]

# We need `trust_remote_code`
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=True)
model = model.to(device)
max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(queries, return_tensors='pt')

# Move tensors to the correct device, but keep them as integers
batch_dict = {k: v.to(device) for k, v in batch_dict.items()}

print("Model device:", next(model.parameters()).device)
print("Input device:", batch_dict['input_ids'].device)
print("Input dtype:", batch_dict['input_ids'].dtype)

# Convert the model's output to float16 if necessary
with torch.cuda.amp.autocast(dtype=torch.float16):
    # We need to set `is_causal=False`
    outputs = model(**batch_dict, is_causal=False)

embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

output

tensor([[-0.0202,  0.0110, -0.0163,  ..., -0.0360,  0.0125, -0.0195]],
       device='cuda:0', grad_fn=<DivBackward0>)

which is the output from TEI for this query.

satishlokkoju commented 31 minutes ago

Thanks for quickly figuring out the root cause. Also, thanks for the open sourcing this project.

Quick Question. What was the reason for choosing Rust other the performance gains ?