langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.3k stars 14.08k forks source link

OpenAIEmbeddings model argument does not work #17409

Closed nyck33 closed 5 months ago

nyck33 commented 5 months ago

Checked other resources

Example Code

from langchain_openai import OpenAIEmbeddings

embedding_model = os.environ.get("EMBEDDING_MODEL")
print(embedding_model)
embedding_dimension = os.environ.get("EMBEDDING_DIMENSION")
print(embedding_dimension)

# the langchain way
embeddings_model_lg = OpenAIEmbeddings(api_key=OPENAI_API_KEY, model=embedding_model, deployment=embedding_model, dimensions=int(embedding_dimension)) 

vectorstore = SupabaseVectorStore(
    client=supabase,
    embedding=embeddings_model_lg,
    table_name="documents",
    query_name="match_documents",
)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)

# %%
# specify a relevant query
query = "How does tree help the boy make the crown? return results with relevance scores"
embedded_query = embeddings_model_lg.embed_query(query)
response = retriever.get_relevant_documents(query)

and in my .env

EMBEDDING_DIMENSION=256 
# edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002
EMBEDDING_MODEL=text-embedding-3-large

Error Message and Stack Trace (if applicable)

2024-02-12 21:49:08,618:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:09,055:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:10,285:INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-02-12 21:49:10,295:INFO - Generated Query: query='tree help boy crown' filter=None limit=None
2024-02-12 21:49:10,296:WARNING - Warning: model not found. Using cl100k_base encoding.
2024-02-12 21:49:10,584:INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-02-12 21:49:11,104:INFO - HTTP Request: POST https://lhbeoisvtsilsquybifs.supabase.co/rest/v1/rpc/match_documents?limit=4 "HTTP/1.1 200 OK"

it's a warning.

Description

I want it to use the model I designated. Can I change the default in base.py?

.
.
.
client: Any = Field(default=None, exclude=True)  #: :meta private:
    async_client: Any = Field(default=None, exclude=True)  #: :meta private:
    model: str = "text-embedding-ada-002"
    dimensions: Optional[int] = None
    """The number of dimensions the resulting o...

I can't believe the results are actually correct but this is a tiny tiny children's book so it could have been a fluke.

[Document(page_content='Once there was a tree.... and she loved a little boy. And everyday the boy would come and he would gather her leaves and make them into crowns and play king of the forest. He would climb up her trunk and swing from her branches and eat apples. And they would play hide-and-go-seek.'), Document(page_content='And the tree was happy. But time went by. And the boy grew older. And the tree was often alone. Then one day the boy came to the tree and the tree said, "Come, Boy, come and climb up my trunk and swing from my branches and eat apples and play in my shade and be happy.'), ...

System Info

(langchain) nyck33@nyck33-lenovo:/media/nyck33/65DA61B605B0A8C1/projects/langchain-deeplearning-ai-tutorial$ pip freeze | grep langchain
langchain==0.1.5
langchain-community==0.0.19
langchain-core==0.1.21
langchain-openai==0.0.5
dosubot[bot] commented 5 months ago

🤖

Hey there, @nyck33! Great to see you diving into the depths of LangChain once again. 🚀

Based on the information you've provided, it seems like the OpenAIEmbeddings class is not recognizing the model you've specified in your environment variables. This is likely due to the way the tiktoken library is used in the OpenAIEmbeddings class. If the model name is not found in the tiktoken library, it defaults to using the "cl100k_base" encoding.

This is evident in the following code snippet from the OpenAIEmbeddings class:

try:
    encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
    logger.warning("Warning: model not found. Using cl100k_base encoding.")
    model = "cl100k_base"
    encoding = tiktoken.get_encoding(model)

This means that if the model you specified in your environment variables is not recognized by tiktoken, it will default to "cl100k_base".

To resolve this issue, you can either use a model name that is recognized by tiktoken, or modify the code to handle unrecognized models in a way that suits your needs. For example, you could raise an error instead of defaulting to "cl100k_base", or you could add a mechanism to add new models to tiktoken.

Regarding the tiktoken library, it is used in the OpenAIEmbeddings class in LangChain to handle tokenization of the input text when generating embeddings. This is particularly important when dealing with long inputs that exceed the maximum context length that the OpenAI API can handle.

In the _get_len_safe_embeddings method, if the tiktoken_enabled flag is set to True, the tiktoken library is used to tokenize the input text into chunks that respect the embedding_ctx_length (the maximum number of tokens to embed at once).

However, if the tiktoken_enabled flag is set to False, the HuggingFace AutoTokenizer is used for tokenization instead.

I hope this helps clarify the issue. If you have any further questions, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
nyck33 commented 5 months ago

So the model being mentioned in your response has to do with tokenization and not the embeddings? I was concerned that it was not able to make the embedding vectors using the model and dimensions I chose: ```EMBEDDING_DIMENSION=256

edit this based on your model preference, e.g. text-embedding-3-small, text-embedding-ada-002

EMBEDDING_MODEL=text-embedding-3-large```

but I guess it is doing that part okay. What are the differences between the tokenization models? You mention tiktoken but what is that?

oliverwang15 commented 5 months ago

TL;DR:

Try:

pip install -U langchain tiktoken

Reasons

From my perspective, the problem may related to the problem that "text-embedding-3-small" and "text-embedding-3-large" were not registered in the tiktoken so the package couldn't find the coding of this embedding so used ''cl100k_base" by default.

After updating, you may find the following in this file:

MODEL_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt-3.5": "cl100k_base",  # Common shorthand
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
    # base
    "davinci-002": "cl100k_base",
    "babbage-002": "cl100k_base",
    # embeddings
    "text-embedding-ada-002": "cl100k_base",
    "text-embedding-3-small": "cl100k_base",
    "text-embedding-3-large": "cl100k_base",
    # DEPRECATED MODELS
    # text (DEPRECATED)
   ...
ARajgor commented 3 months ago

issue is still persist @oliverwang15

Lib versions ->

Langchain: 0.1.14
langchain-openai: 0.1.1
tiktoken: 0.6.0
openai: 1.16.1
pinecone: 3.2.2

my code

from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone

pc = Pinecone()

old_index = pc.Index(INDEX_NAME)

embed = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = embed.embed_query("hello")
print(len(vectors))

old_index.upsert(vectors=[
    {"id": "A", "values": vectors},
  ]
)

printing: 1536

that's why my old_index show miss-match in dimension

{"code":3,"message":"Vector dimension 1536 does not match the dimension of the index 512",