huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.47k stars 154 forks source link

Support tokenized input #273

Open slyt opened 3 months ago

slyt commented 3 months ago

Feature request

The OpenAI API /embedding endpoint accepts input for both text (list of strings) and tokenized input (list of integers). text-embeddings-inference should also support list of integers (tokens) as input to create embeddings from.

Motivation

Without this feature, tools like langchain-openai do not work out of the box. The OpenAIEmbeddings object tokenizes text before sending to the embedding model; the following code throws errors:

from langchain_openai import OpenAIEmbeddings
base_url = "127.0.0.1"
api_key = "sk-xxx"
embeddings = CustomOpenAIEmbeddings( # Use the custom class that doesn't tokenize the input
        model="jina-v2-base",
        base_url=base_url,
        openai_api_key=api_key
)
text = "This is a test query."
query_result = embeddings.embed_query(text)
print(query_result)

A workaround is to force langchain to not tokenize, although it would be cleaner if TEI supported tokens as input:

import os
from typing import List, Tuple, Iterable, Union
from langchain_openai import OpenAIEmbeddings

class CustomOpenAIEmbeddings(OpenAIEmbeddings):
    def _tokenize(
            self, texts: List[str], chunk_size: int
        ) -> Tuple[Iterable[int], List[Union[List[int], str]], List[int]]:
        _iter = range(0, len(texts), chunk_size)
        tokens = texts
        indices = list(range(len(texts)))
        return _iter, tokens, indices

base_url = "127.0.0.1"
api_key = "sk-xxx"
embeddings = CustomOpenAIEmbeddings( # Use the custom class that doesn't tokenize the input
        model="jina-v2-base",
        base_url=base_url,
        openai_api_key=api_key
)
text = "This is a test query."
query_result = embeddings.embed_query(text)
print(query_result)

Your contribution

I'm not familiar with rust so I don't think I'm comfortable implementing this, but I can help test and document the new functionality.

OlivierDehaene commented 2 months ago

We already accept token ids for the open ai route. What errors do you see?

codybum commented 1 month ago

It is accepted, but it does not seem to work correctly in the context of LangChain.

Using the following example, TEI as the backend, and tokenized and non-tokenized clients. Example code: https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/

The pre-tokenized client provides incorrect results.

Custom Client: text: What did the president say about Ketanji Brown Jackson Similarity search:[ And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. ] Similarity search by vector:[ And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. ]

OpenAIEmbeddings Client text: What did the president say about Ketanji Brown Jackson Similarity search:[ It won’t look like much, but if you stop and look closely, you’ll see a “Field of dreams,” the ground on which America’s future will be built. ] Similarity search by vector:[ It won’t look like much, but if you stop and look closely, you’ll see a “Field of dreams,” the ground on which America’s future will be built. ]