Closed nithinreddyyyyyy closed 9 months ago
π€
Hey @nithinreddyyyyyy! Great to see you back here bringing another interesting challenge to the table. Hope you've been doing well since our last encounter! ππ
Based on your request, you want to handle cases where a single row exceeds the OpenAI embeddings limit by splitting that row and appending it back while returning the answer, without using the RecursiveCharacterTextSplitter.
You can achieve this by modifying the _get_len_safe_embeddings
method in the OpenAIEmbeddings
class. This method is responsible for generating length-safe embeddings for a list of texts. It handles tokenization and embedding generation, respecting the set embedding context length and chunk size.
Here's a modified version of the _get_len_safe_embeddings
method that splits the text into smaller parts if it exceeds the OpenAI embeddings limit, embeds each part separately, and then concatenates the embeddings:
def _get_len_safe_embeddings(
self, texts: List[str], *, engine: str, chunk_size: Optional[int] = None
) -> List[List[float]]:
"""
Generate length-safe embeddings for a list of texts.
This method handles tokenization and embedding generation, respecting the
set embedding context length and chunk size. It supports both tiktoken
and HuggingFace tokenizer based on the tiktoken_enabled flag.
Args:
texts (List[str]): A list of texts to embed.
engine (str): The engine or model to use for embeddings.
chunk_size (Optional[int]): The size of chunks for processing embeddings.
Returns:
List[List[float]]: A list of embeddings for each input text.
"""
tokens = []
indices = []
model_name = self.tiktoken_model_name or self.model
_chunk_size = chunk_size or self.chunk_size
# If tiktoken flag set to False
if not self.tiktoken_enabled:
try:
from transformers import AutoTokenizer
except ImportError:
raise ValueError(
"Could not import transformers python package. "
"This is needed in order to for OpenAIEmbeddings without "
"`tiktoken`. Please install it with `pip install transformers`. "
)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_name
)
for i, text in enumerate(texts):
# Tokenize the text using HuggingFace transformers
tokenized = tokenizer.encode(text, add_special_tokens=False)
# Split tokens into chunks respecting the embedding_ctx_length
for j in range(0, len(tokenized), self.embedding_ctx_length):
token_chunk = tokenized[j : j + self.embedding_ctx_length]
tokens.append(token_chunk)
indices.append(i)
# Embed each chunk separately
batched_embeddings = []
for i in range(0, len(tokens), _chunk_size):
token_batch = tokens[i : i + _chunk_size]
response = embed_with_retry(
self,
inputs=token_batch,
**self._invocation_params,
)
if not isinstance(response, dict):
response = response.dict()
batched_embeddings.extend(r["embedding"] for r in response["data"])
# Concatenate the embeddings for each text
embeddings: List[List[float]] = [[] for _ in range(len(texts))]
for i in range(len(indices)):
embeddings[indices[i]].extend(batched_embeddings[i])
return embeddings
This modification will ensure that if a single row exceeds the OpenAI embeddings limit, it will be split into smaller parts, each part will be embedded separately, and the embeddings will be concatenated to form the final embedding for the row.
Please note that the maximum input context length that OpenAI embeddings can handle in the LangChain framework is 8191 tokens. This is specified in the OpenAIEmbeddings class in the 'embedding_ctx_length' attribute.
I hope this helps! If you have any further questions, feel free to ask.
π€
Hey @nithinreddyyyyyy! Great to see you back here bringing another interesting challenge to the table. Hope you've been doing well since our last encounter! ππ
Based on your request, you want to handle cases where a single row exceeds the OpenAI embeddings limit by splitting that row and appending it back while returning the answer, without using the RecursiveCharacterTextSplitter.
You can achieve this by modifying the
_get_len_safe_embeddings
method in theOpenAIEmbeddings
class. This method is responsible for generating length-safe embeddings for a list of texts. It handles tokenization and embedding generation, respecting the set embedding context length and chunk size.Here's a modified version of the
_get_len_safe_embeddings
method that splits the text into smaller parts if it exceeds the OpenAI embeddings limit, embeds each part separately, and then concatenates the embeddings:def _get_len_safe_embeddings( self, texts: List[str], *, engine: str, chunk_size: Optional[int] = None ) -> List[List[float]]: """ Generate length-safe embeddings for a list of texts. This method handles tokenization and embedding generation, respecting the set embedding context length and chunk size. It supports both tiktoken and HuggingFace tokenizer based on the tiktoken_enabled flag. Args: texts (List[str]): A list of texts to embed. engine (str): The engine or model to use for embeddings. chunk_size (Optional[int]): The size of chunks for processing embeddings. Returns: List[List[float]]: A list of embeddings for each input text. """ tokens = [] indices = [] model_name = self.tiktoken_model_name or self.model _chunk_size = chunk_size or self.chunk_size # If tiktoken flag set to False if not self.tiktoken_enabled: try: from transformers import AutoTokenizer except ImportError: raise ValueError( "Could not import transformers python package. " "This is needed in order to for OpenAIEmbeddings without " "`tiktoken`. Please install it with `pip install transformers`. " ) tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path=model_name ) for i, text in enumerate(texts): # Tokenize the text using HuggingFace transformers tokenized = tokenizer.encode(text, add_special_tokens=False) # Split tokens into chunks respecting the embedding_ctx_length for j in range(0, len(tokenized), self.embedding_ctx_length): token_chunk = tokenized[j : j + self.embedding_ctx_length] tokens.append(token_chunk) indices.append(i) # Embed each chunk separately batched_embeddings = [] for i in range(0, len(tokens), _chunk_size): token_batch = tokens[i : i + _chunk_size] response = embed_with_retry( self, inputs=token_batch, **self._invocation_params, ) if not isinstance(response, dict): response = response.dict() batched_embeddings.extend(r["embedding"] for r in response["data"]) # Concatenate the embeddings for each text embeddings: List[List[float]] = [[] for _ in range(len(texts))] for i in range(len(indices)): embeddings[indices[i]].extend(batched_embeddings[i]) return embeddings
This modification will ensure that if a single row exceeds the OpenAI embeddings limit, it will be split into smaller parts, each part will be embedded separately, and the embeddings will be concatenated to form the final embedding for the row.
Please note that the maximum input context length that OpenAI embeddings can handle in the LangChain framework is 8191 tokens. This is specified in the OpenAIEmbeddings class in the 'embedding_ctx_length' attribute.
I hope this helps! If you have any further questions, feel free to ask.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Can you help me in utilizing the above code to the code which i gave?
π€
Sure, I can help you integrate the provided code into your existing code. The idea is to use the _get_len_safe_embeddings
method to generate embeddings for your documents, and then use these embeddings to create your FAISS index. Here's how you can do it:
First, you need to import the necessary modules and classes:
from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.loaders.csv import CSVLoader
from langchain_community.vector_stores.faiss import FAISS
Next, you need to load your CSV files:
# List of file paths for your CSV files
csv_files = ['1.csv']
# Iterate over the file paths and create a loader for each file
loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files]
# Optional: If you need to combine the data from all loaders
documents = []
for loader in loaders:
data = loader.load() # or however you retrieve data from the loader
documents.extend(data)
Then, you need to instantiate the OpenAIEmbeddings
class and use the _get_len_safe_embeddings
method to generate embeddings for your documents:
# Instantiate the OpenAIEmbeddings class
openai = OpenAIEmbeddings(openai_api_key="your-api-key")
# Generate embeddings for your documents
embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="davinci")
Finally, you can use these embeddings to create your FAISS index and retriever:
# Create a FAISS vector store from the embeddings
vectorStore = FAISS.from_embeddings(embeddings)
# Create a retriever for the vector database
retriever = vectorStore.as_retriever(search_kwargs={"k": 5})
docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
Please replace "your-api-key"
with your actual OpenAI API key. Also, replace "davinci"
with the engine you want to use for generating embeddings.
I hope this helps! If you have any further questions, feel free to ask.
π€
Sure, I can help you integrate the provided code into your existing code. The idea is to use the
_get_len_safe_embeddings
method to generate embeddings for your documents, and then use these embeddings to create your FAISS index. Here's how you can do it:First, you need to import the necessary modules and classes:
from langchain_community.embeddings.openai import OpenAIEmbeddings from langchain_community.loaders.csv import CSVLoader from langchain_community.vector_stores.faiss import FAISS
Next, you need to load your CSV files:
# List of file paths for your CSV files csv_files = ['1.csv'] # Iterate over the file paths and create a loader for each file loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files] # Optional: If you need to combine the data from all loaders documents = [] for loader in loaders: data = loader.load() # or however you retrieve data from the loader documents.extend(data)
Then, you need to instantiate the
OpenAIEmbeddings
class and use the_get_len_safe_embeddings
method to generate embeddings for your documents:# Instantiate the OpenAIEmbeddings class openai = OpenAIEmbeddings(openai_api_key="your-api-key") # Generate embeddings for your documents embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="davinci")
Finally, you can use these embeddings to create your FAISS index and retriever:
# Create a FAISS vector store from the embeddings vectorStore = FAISS.from_embeddings(embeddings) # Create a retriever for the vector database retriever = vectorStore.as_retriever(search_kwargs={"k": 5}) docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
Please replace
"your-api-key"
with your actual OpenAI API key. Also, replace"davinci"
with the engine you want to use for generating embeddings.I hope this helps! If you have any further questions, feel free to ask.
Sources
About Dosu
i tried using the above code and made some changes, below's the updated code
# Instantiate the OpenAIEmbeddings class
openai = OpenAIEmbeddings(openai_api_key="")
# Generate embeddings for your documents
embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002")
# Create a FAISS vector store from the embeddings
vectorStore = FAISS.from_embeddings(embeddings)
# Create a retriever for the vector database
retriever = vectorStore.as_retriever(search_kwargs={"k": 5})
docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
below is the error of the above code
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-62-e1d306e4256b>](https://localhost:8080/#) in <cell line: 8>()
6
7 # Create a FAISS vector store from the embeddings
----> 8 vectorStore = FAISS.from_embeddings(embeddings)
9
10 # Create a retriever for the vector database
TypeError: FAISS.from_embeddings() missing 1 required positional argument: 'embedding'
Can you assist me with this? a resolved code would be helpful
π€
π€
The error you're encountering is due to the fact that the
FAISS.from_embeddings()
method expects two arguments:text_embeddings
andembedding
. Thetext_embeddings
argument should be an iterable of tuples, where each tuple contains a string (the text) and a list of floats (the corresponding embedding). Theembedding
argument should be an instance of theEmbeddings
class.In your code, you're only providing the
embeddings
argument, which is causing the error. To fix this, you need to provide both arguments. Here's how you can modify your code:# Instantiate the OpenAIEmbeddings class openai = OpenAIEmbeddings(openai_api_key="") # Generate embeddings for your documents embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002") # Create tuples of text and corresponding embedding text_embeddings = list(zip([doc.page_content for doc in documents], embeddings)) # Create a FAISS vector store from the embeddings vectorStore = FAISS.from_embeddings(text_embeddings, openai) # Create a retriever for the vector database retriever = vectorStore.as_retriever(search_kwargs={"k": 5}) docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
In this modified code, I'm creating a list of tuples
text_embeddings
where each tuple contains a text from your documents and its corresponding embedding. Then, I'm passing this list along with theopenai
instance to theFAISS.from_embeddings()
method.I hope this helps! If you have any further questions, feel free to ask.
Sources
About Dosu
π€ Hey @nithinreddyyyyyy! Great to see you back here bringing another interesting challenge to the table. Hope you've been doing well since our last encounter! ππ Based on your request, you want to handle cases where a single row exceeds the OpenAI embeddings limit by splitting that row and appending it back while returning the answer, without using the RecursiveCharacterTextSplitter. You can achieve this by modifying the
_get_len_safe_embeddings
method in theOpenAIEmbeddings
class. This method is responsible for generating length-safe embeddings for a list of texts. It handles tokenization and embedding generation, respecting the set embedding context length and chunk size. Here's a modified version of the_get_len_safe_embeddings
method that splits the text into smaller parts if it exceeds the OpenAI embeddings limit, embeds each part separately, and then concatenates the embeddings:def _get_len_safe_embeddings( self, texts: List[str], *, engine: str, chunk_size: Optional[int] = None ) -> List[List[float]]: """ Generate length-safe embeddings for a list of texts. This method handles tokenization and embedding generation, respecting the set embedding context length and chunk size. It supports both tiktoken and HuggingFace tokenizer based on the tiktoken_enabled flag. Args: texts (List[str]): A list of texts to embed. engine (str): The engine or model to use for embeddings. chunk_size (Optional[int]): The size of chunks for processing embeddings. Returns: List[List[float]]: A list of embeddings for each input text. """ tokens = [] indices = [] model_name = self.tiktoken_model_name or self.model _chunk_size = chunk_size or self.chunk_size # If tiktoken flag set to False if not self.tiktoken_enabled: try: from transformers import AutoTokenizer except ImportError: raise ValueError( "Could not import transformers python package. " "This is needed in order to for OpenAIEmbeddings without " "`tiktoken`. Please install it with `pip install transformers`. " ) tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path=model_name ) for i, text in enumerate(texts): # Tokenize the text using HuggingFace transformers tokenized = tokenizer.encode(text, add_special_tokens=False) # Split tokens into chunks respecting the embedding_ctx_length for j in range(0, len(tokenized), self.embedding_ctx_length): token_chunk = tokenized[j : j + self.embedding_ctx_length] tokens.append(token_chunk) indices.append(i) # Embed each chunk separately batched_embeddings = [] for i in range(0, len(tokens), _chunk_size): token_batch = tokens[i : i + _chunk_size] response = embed_with_retry( self, inputs=token_batch, **self._invocation_params, ) if not isinstance(response, dict): response = response.dict() batched_embeddings.extend(r["embedding"] for r in response["data"]) # Concatenate the embeddings for each text embeddings: List[List[float]] = [[] for _ in range(len(texts))] for i in range(len(indices)): embeddings[indices[i]].extend(batched_embeddings[i]) return embeddings
This modification will ensure that if a single row exceeds the OpenAI embeddings limit, it will be split into smaller parts, each part will be embedded separately, and the embeddings will be concatenated to form the final embedding for the row. Please note that the maximum input context length that OpenAI embeddings can handle in the LangChain framework is 8191 tokens. This is specified in the OpenAIEmbeddings class in the 'embedding_ctx_length' attribute. I hope this helps! If you have any further questions, feel free to ask.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Can you help me in utilizing the above code to the code which i gave?
shall i use this code function which you modified _get_len_safe_embeddings
else will the below one works?
embeddings = openai._get_len_safe_embeddings([doc.page_content for doc in documents], engine="text-embedding-ada-002")
calling this function directly from the openai code itself. Which one works else both are same?
π€
Based on the error message you provided, it seems like there's a misunderstanding about the method you're trying to use. The error message indicates that FAISS.from_embeddings()
is missing a required positional argument: 'embedding'. However, based on the context provided, it appears that the LangChain codebase does not have a FAISS.from_embeddings()
method.
Instead, the LangChain codebase has a FAISS.from_texts()
method. This method requires two parameters: texts
and embeddings
. The texts
parameter is a list of strings that will be used to create the FAISS index. The embeddings
parameter is an instance of an embeddings class that is used to convert the texts into vector representations. An optional parameter metadatas
can also be provided. This is a list of dictionaries containing metadata for each text.
Here's how you might use it:
# Assuming `texts` is your list of documents and `embeddings` is your embeddings instance
vectorStore = FAISS.from_texts(texts, embeddings)
If you're trying to create a FAISS vector store from embeddings directly, you might need to check the documentation or source code of the FAISS library to see if there's a method that suits your needs. If such a method doesn't exist, you might need to create the FAISS index manually from your embeddings.
If you're still encountering issues or if this doesn't answer your question, could you please provide more details about what you're trying to achieve and the exact code that's causing the error? This will help me provide a more accurate solution.
Issue with current documentation:
below's the code which will load csv, then it'll be loaded into FAISS and will try to get the relevant documents, its not using RecursiveCharacterTextSplitter for chunking as the data is already chunked manually, below's the code
print(documents[0])
output is belowDocument(page_content=": 1\nUnnamed: 0: 1\nText: Human Rights Guiding Principles\n We commit to respect internationally recognized human rights as expressed in International Bill of Human Rights meaning \n the Universal Declaration of Human Rights, the International Covenant87543\nx2: 1548.48193973303\ny2: 899.030945822597\nBlock Type: LAYOUT_TEXT\nBlock ID: 54429a7486164c04b859d0a08ac75d54\npage_num: 2\nis_answer: 0", metadata={'source': '1.csv', 'row': 1})
followed by
I want to handle cases where a single row exceeds the OpenAI embeddings limit by splitting that row and appending it back while returning the answer because i'm not using RecursiveCharacterTextSplitter. Can you write code for me? I just want a code like below
Idea or request for content:
No response