Closed Smityz closed 1 hour ago
We face the same issue for "Alibaba-NLP/gte-Qwen2-1.5B-instruct" Model. Local endpoint using docker
model=Alibaba-NLP/gte-Qwen2-1.5B-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model
Call the endpoint
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"}' \
-H 'Content-Type: application/json'
Output[:6]:
[-0.02018845,0.011100909,-0.016257592,-0.028433666,0.008306849,0.011052972,0.043034]
Using Sentence Transformers / transformer library
import torch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-1.5B-instruct", trust_remote_code=False)
model.max_seq_length = 8192
queries = [
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
]
query_embeddings = model.encode(queries)
Output[:6]
[-0.01031005, -0.0228409 , -0.00298887, -0.01025489, -0.00989748, 0.01034117]
Using transformers
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
print("CUDA available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
torch.set_default_device(device)
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
queries = [
'search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten'
]
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=False)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=False)
model = model.to(device)
max_length = 8192
# Tokenize the input texts
batch_dict = tokenizer(queries, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
# Move tensors to the correct device, but keep them as integers
batch_dict = {k: v.to(device) for k, v in batch_dict.items()}
print("Model device:", next(model.parameters()).device)
print("Input device:", batch_dict['input_ids'].device)
print("Input dtype:", batch_dict['input_ids'].dtype)
# Convert the model's output to float16 if necessary
with torch.cuda.amp.autocast(dtype=torch.float16):
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
output
[-0.0103, -0.0229, -0.0030, -0.0102, -0.0098, 0.0105]
Similar to Sentence Transformers but completely different from the output using docker generated with TEI.
Also, we observed that flash attention is used by default for Qwen2 in TEI. We tried enabling flash attention in Sentence transformers library but observed similar variation as above.
@Smityz,
It is just that for /embeddings
, normalisation is set.
curl -X POST "http://localhost:8080/embeddings" \
-H "Content-Type: application/json" \
-d '{"input":["Dimension table for main account?"]}'
# [[-0.0006319029,-0.039351847,-0.010187608,-0.019213213,-0.014293012,0.022455906, ...]
curl -X POST "http://localhost:8080" \
-H "Content-Type: application/json" \
-d '{"inputs":["Dimension table for main account?"], "normalize": true}'
# [[-0.0006319029,-0.039351847,-0.010187608,-0.019213213,-0.014293012,0.022455906, ...]
curl -X POST "http://localhost:8080" \
-H "Content-Type: application/json" \
-d '{"inputs":["Dimension table for main account?"], "normalize": false}'
# [[-0.015187595,-0.9458097,-0.24485607,-0.46178377,-0.3435282,0.5397209,...]
@satishlokkoju,
Regarding Alibaba-NLP/gte-Qwen2-1.5B-instruct
, there are multiple factors at play:
For this model, Alibaba made some modifications to the Qwen2 architecture and you need to set trust_remote_code=True
otherwise the embeddings are not correct
There is a bug in Alibaba's code where a default value is not set correctly. See this thread where I pointed out the bug to their team. Unfortunately they failed to patch it correctly.
TEI re-implements the model entirely and does not suffer from both issues.
If we properly set all these arguments we have the following code:
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
print("CUDA available:", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
torch.set_default_device(device)
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
queries = [
'search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten'
]
# We need `trust_remote_code`
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained('Alibaba-NLP/gte-Qwen2-1.5B-instruct', trust_remote_code=True)
model = model.to(device)
max_length = 8192
# Tokenize the input texts
batch_dict = tokenizer(queries, return_tensors='pt')
# Move tensors to the correct device, but keep them as integers
batch_dict = {k: v.to(device) for k, v in batch_dict.items()}
print("Model device:", next(model.parameters()).device)
print("Input device:", batch_dict['input_ids'].device)
print("Input dtype:", batch_dict['input_ids'].dtype)
# Convert the model's output to float16 if necessary
with torch.cuda.amp.autocast(dtype=torch.float16):
# We need to set `is_causal=False`
outputs = model(**batch_dict, is_causal=False)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
output
tensor([[-0.0202, 0.0110, -0.0163, ..., -0.0360, 0.0125, -0.0195]],
device='cuda:0', grad_fn=<DivBackward0>)
which is the output from TEI for this query.
Thanks for quickly figuring out the root cause. Also, thanks for the open sourcing this project.
Quick Question. What was the reason for choosing Rust other the performance gains ?
System Info
Information
Tasks
Reproduction
When using
SentenceTransformer
, it will download a new model namedAlibaba-NLP/new-impl
, but TEI may use the original model.Is there anyway to make TEI to use
Alibaba-NLP/new-impl
? I tried to modify the repo files ref https://huggingface.co/Alibaba-NLP/new-impl/discussions/2, but it's not working.Expected behavior
the embedding results are the same