Closed nyck33 closed 5 months ago
🤖
Hey there, @nyck33! Great to see you diving into the depths of LangChain once again. 🚀
Based on the information you've provided, it seems like the OpenAIEmbeddings
class is not recognizing the model you've specified in your environment variables. This is likely due to the way the tiktoken
library is used in the OpenAIEmbeddings
class. If the model name is not found in the tiktoken
library, it defaults to using the "cl100k_base" encoding.
This is evident in the following code snippet from the OpenAIEmbeddings
class:
try:
encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
logger.warning("Warning: model not found. Using cl100k_base encoding.")
model = "cl100k_base"
encoding = tiktoken.get_encoding(model)
This means that if the model you specified in your environment variables is not recognized by tiktoken
, it will default to "cl100k_base".
To resolve this issue, you can either use a model name that is recognized by tiktoken
, or modify the code to handle unrecognized models in a way that suits your needs. For example, you could raise an error instead of defaulting to "cl100k_base", or you could add a mechanism to add new models to tiktoken
.
Regarding the tiktoken
library, it is used in the OpenAIEmbeddings
class in LangChain to handle tokenization of the input text when generating embeddings. This is particularly important when dealing with long inputs that exceed the maximum context length that the OpenAI API can handle.
In the _get_len_safe_embeddings
method, if the tiktoken_enabled
flag is set to True
, the tiktoken
library is used to tokenize the input text into chunks that respect the embedding_ctx_length
(the maximum number of tokens to embed at once).
However, if the tiktoken_enabled
flag is set to False
, the HuggingFace AutoTokenizer
is used for tokenization instead.
I hope this helps clarify the issue. If you have any further questions, feel free to ask.
So the model being mentioned in your response has to do with tokenization and not the embeddings? I was concerned that it was not able to make the embedding vectors using the model and dimensions I chose: ```EMBEDDING_DIMENSION=256
EMBEDDING_MODEL=text-embedding-3-large```
but I guess it is doing that part okay. What are the differences between the tokenization models? You mention tiktoken
but what is that?
Try:
pip install -U langchain tiktoken
From my perspective, the problem may related to the problem that "text-embedding-3-small" and "text-embedding-3-large" were not registered in the tiktoken
so the package couldn't find the coding of this embedding so used ''cl100k_base" by default.
After updating, you may find the following in this file:
MODEL_TO_ENCODING: dict[str, str] = {
# chat
"gpt-4": "cl100k_base",
"gpt-3.5-turbo": "cl100k_base",
"gpt-3.5": "cl100k_base", # Common shorthand
"gpt-35-turbo": "cl100k_base", # Azure deployment name
# base
"davinci-002": "cl100k_base",
"babbage-002": "cl100k_base",
# embeddings
"text-embedding-ada-002": "cl100k_base",
"text-embedding-3-small": "cl100k_base",
"text-embedding-3-large": "cl100k_base",
# DEPRECATED MODELS
# text (DEPRECATED)
...
issue is still persist @oliverwang15
Lib versions ->
Langchain: 0.1.14
langchain-openai: 0.1.1
tiktoken: 0.6.0
openai: 1.16.1
pinecone: 3.2.2
my code
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone
pc = Pinecone()
old_index = pc.Index(INDEX_NAME)
embed = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = embed.embed_query("hello")
print(len(vectors))
old_index.upsert(vectors=[
{"id": "A", "values": vectors},
]
)
printing: 1536
that's why my old_index show miss-match in dimension
{"code":3,"message":"Vector dimension 1536 does not match the dimension of the index 512",
Checked other resources
Example Code
and in my .env
Error Message and Stack Trace (if applicable)
it's a warning.
Description
I want it to use the model I designated. Can I change the default in base.py?
I can't believe the results are actually correct but this is a tiny tiny children's book so it could have been a fluke.
System Info