Closed nithinreddyyyyyy closed 9 months ago
š¤
Hey @nithinreddyyyyyy! Great to see you diving into another interesting challenge. How have you been?
Based on your requirements, you can use the embed_documents
method from the OpenAIEmbeddings
class to generate embeddings for your manually chunked data. This method handles the tokenization and embedding generation, respecting the set embedding context length and chunk size. It also supports both tiktoken and HuggingFace tokenizer based on the tiktoken_enabled
flag.
Here's how you can use it:
from langchain_community.embeddings import OpenAIEmbeddings
# Initialize OpenAIEmbeddings
openai = OpenAIEmbeddings(openai_api_key="your-api-key")
# Assuming 'documents' is your list of manually chunked data
embeddings = openai.embed_documents(documents)
The embed_documents
method internally calls the _get_len_safe_embeddings
method which handles cases where a single row exceeds the OpenAI embeddings limit. It splits the tokens into chunks respecting the embedding_ctx_length
and processes each chunk separately. The embeddings for each chunk are then averaged to get the final embedding for the text.
Once you have the embeddings, you can use the FAISS.from_vectors
method to create a FAISS instance:
vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
Here, metadatas
is a list of metadata for each document. You can adjust this according to your needs.
Finally, you can create a retriever and get relevant documents as follows:
# Create a retriever for the vector database
retriever = vectorStore.as_retriever(search_kwargs={"k": 5})
docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
Please note that you need to replace "your-api-key"
with your actual OpenAI API key. Also, ensure that the documents
variable contains your manually chunked data.
For more details, you can refer to the source code of the OpenAIEmbeddings class and the FAISS class in the LangChain repository.
I hope this helps! If you have any further questions, feel free to ask.
š¤
Hey @nithinreddyyyyyy! Great to see you diving into another interesting challenge. How have you been?
Based on your requirements, you can use the
embed_documents
method from theOpenAIEmbeddings
class to generate embeddings for your manually chunked data. This method handles the tokenization and embedding generation, respecting the set embedding context length and chunk size. It also supports both tiktoken and HuggingFace tokenizer based on thetiktoken_enabled
flag.Here's how you can use it:
from langchain_community.embeddings import OpenAIEmbeddings # Initialize OpenAIEmbeddings openai = OpenAIEmbeddings(openai_api_key="your-api-key") # Assuming 'documents' is your list of manually chunked data embeddings = openai.embed_documents(documents)
The
embed_documents
method internally calls the_get_len_safe_embeddings
method which handles cases where a single row exceeds the OpenAI embeddings limit. It splits the tokens into chunks respecting theembedding_ctx_length
and processes each chunk separately. The embeddings for each chunk are then averaged to get the final embedding for the text.Once you have the embeddings, you can use the
FAISS.from_vectors
method to create a FAISS instance:vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
Here,
metadatas
is a list of metadata for each document. You can adjust this according to your needs.Finally, you can create a retriever and get relevant documents as follows:
# Create a retriever for the vector database retriever = vectorStore.as_retriever(search_kwargs={"k": 5}) docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
Please note that you need to replace
"your-api-key"
with your actual OpenAI API key. Also, ensure that thedocuments
variable contains your manually chunked data.For more details, you can refer to the source code of the OpenAIEmbeddings class and the FAISS class in the LangChain repository.
I hope this helps! If you have any further questions, feel free to ask.
Sources
- libs/community/langchain_community/embeddings/openai.py
- libs/community/tests/unit_tests/vectorstores/test_faiss.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
It has returned below issue, can you please look into it
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-18-791eb5000c7b>](https://localhost:8080/#) in <cell line: 7>()
5
6 # Assuming 'documents' is your list of manually chunked data
----> 7 embeddings = openai.embed_documents(documents)
8
9 vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
2 frames
[/usr/local/lib/python3.10/dist-packages/langchain_community/embeddings/openai.py](https://localhost:8080/#) in embed_documents(self, texts, chunk_size)
666 # than the maximum context and use length-safe embedding function.
667 engine = cast(str, self.deployment)
--> 668 return self._get_len_safe_embeddings(texts, engine=engine)
669
670 async def aembed_documents(
[/usr/local/lib/python3.10/dist-packages/langchain_community/embeddings/openai.py](https://localhost:8080/#) in _get_len_safe_embeddings(self, texts, engine, chunk_size)
469 text = text.replace("\n", " ")
470
--> 471 token = encoding.encode(
472 text=text,
473 allowed_special=self.allowed_special,
[/usr/local/lib/python3.10/dist-packages/tiktoken/core.py](https://localhost:8080/#) in encode(self, text, allowed_special, disallowed_special)
114 if not isinstance(disallowed_special, frozenset):
115 disallowed_special = frozenset(disallowed_special)
--> 116 if match := _special_token_regex(disallowed_special).search(text):
117 raise_disallowed_special_token(match.group())
118
TypeError: expected string or buffer
below is the documents[0]
format
Document(page_content=': 0\nUnnamed: 0: 0\nText: AMD L\nx1: 1316.616076231\ny1: 113.813292980194\nx2: 1579.33993190527\ny2: 170.601672306657\nBlock Type: LAYOUT_FIGURE\nBlock ID: 2a2a4e4ccdb04e41988882bd6c4512c0\npage_num: 2\nis_answer: 0', metadata={'source': '1.csv', 'row': 0})
can you please help me resolving this?
š¤
š¤
The error you're encountering is due to the fact that the embed_documents
method of the OpenAIEmbeddings
class expects a list of strings as input, but you're providing a list of Document
objects. The embed_documents
method tries to apply the replace
method to each element in the list, which works for strings but not for Document
objects, hence the TypeError
.
To resolve this issue, you need to extract the page_content
from each Document
object and pass that to the embed_documents
method. Here's how you can do it:
# Extract page_content from each Document object
texts = [doc.page_content for doc in documents]
# Generate embeddings
embeddings = openai.embed_documents(texts)
Then, you can proceed with creating the FAISS instance and the retriever as you've done in your code.
This solution is based on a similar issue found in the LangChain repository: AttributeError: Document
object has not attribute replace
. The user in that issue encountered a similar problem and resolved it by extracting the page_content
from each Document
object before passing the list to the embed_documents
method.
I hope this helps! If you encounter any further issues, please let me know.
Document
object has not attribute replace
š¤ Hey @nithinreddyyyyyy! Great to see you diving into another interesting challenge. How have you been? Based on your requirements, you can use the
embed_documents
method from theOpenAIEmbeddings
class to generate embeddings for your manually chunked data. This method handles the tokenization and embedding generation, respecting the set embedding context length and chunk size. It also supports both tiktoken and HuggingFace tokenizer based on thetiktoken_enabled
flag. Here's how you can use it:from langchain_community.embeddings import OpenAIEmbeddings # Initialize OpenAIEmbeddings openai = OpenAIEmbeddings(openai_api_key="your-api-key") # Assuming 'documents' is your list of manually chunked data embeddings = openai.embed_documents(documents)
The
embed_documents
method internally calls the_get_len_safe_embeddings
method which handles cases where a single row exceeds the OpenAI embeddings limit. It splits the tokens into chunks respecting theembedding_ctx_length
and processes each chunk separately. The embeddings for each chunk are then averaged to get the final embedding for the text. Once you have the embeddings, you can use theFAISS.from_vectors
method to create a FAISS instance:vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
Here,
metadatas
is a list of metadata for each document. You can adjust this according to your needs. Finally, you can create a retriever and get relevant documents as follows:# Create a retriever for the vector database retriever = vectorStore.as_retriever(search_kwargs={"k": 5}) docs = retriever.get_relevant_documents("can you return the details of banpu company hrdd?")
Please note that you need to replace
"your-api-key"
with your actual OpenAI API key. Also, ensure that thedocuments
variable contains your manually chunked data. For more details, you can refer to the source code of the OpenAIEmbeddings class and the FAISS class in the LangChain repository. I hope this helps! If you have any further questions, feel free to ask.Sources
- libs/community/langchain_community/embeddings/openai.py
- libs/community/tests/unit_tests/vectorstores/test_faiss.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
It has returned below issue, can you please look into it
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) [<ipython-input-18-791eb5000c7b>](https://localhost:8080/#) in <cell line: 7>() 5 6 # Assuming 'documents' is your list of manually chunked data ----> 7 embeddings = openai.embed_documents(documents) 8 9 vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents]) 2 frames [/usr/local/lib/python3.10/dist-packages/langchain_community/embeddings/openai.py](https://localhost:8080/#) in embed_documents(self, texts, chunk_size) 666 # than the maximum context and use length-safe embedding function. 667 engine = cast(str, self.deployment) --> 668 return self._get_len_safe_embeddings(texts, engine=engine) 669 670 async def aembed_documents( [/usr/local/lib/python3.10/dist-packages/langchain_community/embeddings/openai.py](https://localhost:8080/#) in _get_len_safe_embeddings(self, texts, engine, chunk_size) 469 text = text.replace("\n", " ") 470 --> 471 token = encoding.encode( 472 text=text, 473 allowed_special=self.allowed_special, [/usr/local/lib/python3.10/dist-packages/tiktoken/core.py](https://localhost:8080/#) in encode(self, text, allowed_special, disallowed_special) 114 if not isinstance(disallowed_special, frozenset): 115 disallowed_special = frozenset(disallowed_special) --> 116 if match := _special_token_regex(disallowed_special).search(text): 117 raise_disallowed_special_token(match.group()) 118 TypeError: expected string or buffer
below is the
documents[0]
format
Document(page_content=': 0\nUnnamed: 0: 0\nText: AMD L\nx1: 1316.616076231\ny1: 113.813292980194\nx2: 1579.33993190527\ny2: 170.601672306657\nBlock Type: LAYOUT_FIGURE\nBlock ID: 2a2a4e4ccdb04e41988882bd6c4512c0\npage_num: 2\nis_answer: 0', metadata={'source': '1.csv', 'row': 0})
can you please help me resolving this?
below's the updated code
from langchain_community.embeddings import OpenAIEmbeddings
# Initialize OpenAIEmbeddings
openai = OpenAIEmbeddings(openai_api_key="sk-")
# Extract page_content from each Document object
texts = [doc.page_content for doc in documents]
# Generate embeddings
embeddings = openai.embed_documents(texts)
vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
# Create a retriever for the vector database
retriever = vectorStore.as_retriever(search_kwargs={"k": 5})
docs = retriever.get_relevant_documents("data related to hrdd")
and the error is below
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-41-4245045a3e1c>](https://localhost:8080/#) in <cell line: 12>()
10 embeddings = openai.embed_documents(texts)
11
---> 12 vectorStore = FAISS.from_vectors(embeddings, metadatas=[doc.metadata for doc in documents])
13
14 # Create a retriever for the vector database
AttributeError: type object 'FAISS' has no attribute 'from_vectors'
can you please help me out resolving this error?
š¤
Based on the context provided, it seems like you're encountering two main issues. Let's address them one by one.
OpenAIEmbeddings
class expects a list of strings as input, but you're providing it with a list of Document
objects. You can resolve this by extracting the page_content
from each Document
object and passing that to the embed_documents
method. Here's how you can do it:# Extract page_content from each Document object
docs_as_str = [doc.page_content for doc in documents]
# Generate the embeddings
embeddings = OpenAIEmbeddings()
embedded_docs = embeddings.embed_documents(docs_as_str)
FAISS
class in LangChain does not have a from_vectors
method. Instead, you should use the from_documents
method to create a FAISS
instance from a list of documents. Here's how you can do it:# Initialize FAISS
db = FAISS.from_documents(embedded_docs, embeddings)
In this code, embedded_docs
is a list of documents (each document being a string), and embeddings
is an instance of the OpenAIEmbeddings
class.
I hope this helps! If you have any other questions or run into any more issues, feel free to ask.
For more information, you can refer to the following sources:
Document
object has not attribute replace
Document
object has not attribute replace
Issue with current documentation:
below's the code which loads a CSV file and create a variable documents
now for code
documents[1]
, below's the outputDocument(page_content=": 1\nUnnamed: 0: 1\nText: Human Rights Guiding Principles\n We commit to respect internationally recognized human rights as expressed in International Bill of Human Rights meaning \n the Universal Declaration of Human Rights, the International Covenant on Civil and Political Rights and the International \n Covenant on Economic, Social and Cultural Rights, and International\nx1: 149.214813858271\ny1: 209.333904087543\nx2: 1548.48193973303\ny2: 899.030945822597\nBlock Type: LAYOUT_TEXT\nBlock ID: 54429a7486164c04b859d0a08ac75d54\npage_num: 2\nis_answer: 0", metadata={'source': '1.csv', 'row': 1})
normal method of chunking data and sending it to index is below
now, how to do i send the documents data to FAISS without splitting it again, because i've already chunked the data manually,
but also also want to handle cases where a single row exceeds the OpenAI embeddings limit by splitting that row and appending it back while returning the answer. Can you write code for me? I just want a code like below
Idea or request for content:
No response