Closed juliuskittler closed 4 months ago
🤖
Hello,
Thank you for reaching out and providing a detailed description of the issue you're facing.
Based on the context provided, it seems there might be a misunderstanding about the usage of the FAISS.from_documents(docs, embeddings)
and Chroma.from_documents(docs, embeddings)
methods.
In the LangChain framework, the FAISS
class does not have a from_documents
method. Instead, it provides two similar methods: FAISS.from_texts(texts, embeddings)
and FAISS.from_embeddings(text_embeddings, embeddings)
.
The FAISS.from_texts(texts, embeddings)
method creates a FAISS vectorstore from a list of documents, while the FAISS.from_embeddings(text_embeddings, embeddings)
method creates a FAISS vectorstore from a list of pre-computed text-embedding pairs.
Here's an example of how to use the FAISS.from_embeddings
method:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Assuming docs is a list of your documents and embeddings is a list of their corresponding embeddings
text_embeddings = zip(docs, embeddings)
embeddings = OpenAIEmbeddings()
faiss = FAISS.from_embeddings(text_embeddings, embeddings)
For the Chroma
class, it does have a from_documents
method. This method creates a Chroma vectorstore from a list of documents. Here's an example of how to use it:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Assuming docs is a list of your documents
embeddings = OpenAIEmbeddings()
chroma = Chroma.from_documents(docs, embeddings)
Please replace FAISS.from_documents(docs, embeddings)
and Chroma.from_documents(docs, embeddings)
with the correct method based on your requirements.
If you continue to face issues, please provide more details and I'll be happy to assist you further.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot
Regarding FAISS
You are saying that FAISS does not have a from_documents method. However, in the example it does have such a method. See https://python.langchain.com/docs/integrations/vectorstores/faiss, where there is the code snippet:
db = FAISS.from_documents(docs, embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
Moreover, FAISS.from_documents
results in the printout <bound method VectorStore.from_documents of <class 'langchain.vectorstores.faiss.FAISS'>>
.
-> Your answer does not explain why I get the error AttributeError: data
when calling FAISS.from_documents
.
Regarding Chroma
You are saying that there is indeed a method Chroma.from_documents
. I have used this method in my reproducible example.
-> Your answer does not explain why I get the error AttributeError: data
when calling Chroma.from_documents
.
🤖
Hello,
Thank you for reaching out and providing a detailed description of your issue. The AttributeError: data
error you're encountering when trying to use the FAISS.from_documents(docs, embeddings)
and Chroma.from_documents(docs, embeddings)
methods in the LangChain Python framework is likely due to the fact that the from_documents
method is expecting a list of Document
objects and an Embeddings
object as arguments.
The from_documents
method is defined in the Chroma
class as follows:
@classmethod
def from_documents(
cls: Type[Chroma],
documents: List[Document],
embedding: Optional[Embeddings] = None,
ids: Optional[List[str]] = None,
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
persist_directory: Optional[str] = None,
client_settings: Optional[chromadb.config.Settings] = None,
client: Optional[chromadb.Client] = None,
collection_metadata: Optional[Dict] = None,
**kwargs: Any,
) -> Chroma:
"""Create a Chroma vectorstore from a raw documents.
If a persist_directory is specified, the collection will be persisted there.
Otherwise, the data will be ephemeral in-memory.
"""
texts = [doc.page_content for doc in documents]
metadatas = [doc.metadata for doc in documents]
return cls.from_texts(
texts=texts,
embedding=embedding,
metadatas=metadatas,
ids=ids,
collection_name=collection_name,
persist_directory=persist_directory,
client_settings=client_settings,
client=client,
collection_metadata=collection_metadata,
**kwargs,
)
As you can see, it's extracting the page_content
and metadata
from each Document
in the documents
list and then calling the from_texts
method with these extracted texts and the provided Embeddings
object.
If you're encountering an AttributeError: data
error, it's likely because either the documents
argument you're passing to from_documents
is not a list of Document
objects, or the embedding
argument is not an Embeddings
object.
To fix this error, ensure that the documents
argument is a list of Document
objects and the embedding
argument is an Embeddings
object. If you're not sure how to create these objects, you may need to refer to the documentation or source code of the LangChain Python framework for more information.
In the context shared, the Document
objects are created with page_content
and metadata
attributes, and the Embeddings
object is created using the OpenAIEmbeddings
class. The from_documents
method is then called with these Document
objects and Embeddings
object as arguments. If you're following a similar process and still encountering the error, there may be an issue with the data you're passing to the Document
or Embeddings
constructors.
I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@juliuskittler I was not able to reproduce your issue when using
langchain 0.0.286 faiss-cpu 1.7.4 chromadb 0.4.8
worked smoothly for me; not sure if that helps but maybe my versions will solve your issue
Alright, thank you for getting back. I am actually getting the same error when using exactly the same versions as you (see below).
langchain 0.0.286 faiss-cpu 1.7.4 chromadb 0.4.8
I have also tried different openai versions (0.28.1, 0.28.0, 0.27.9, 0.27.8, 0.27.7) but the error persists.
I will update this in case I have time to investigate further. For now, FAISS.from_embeddings works for me at least (and I can use it instead of FAISS.from_documents).
The .from_documents method works for me only if I expliclity set chunk_size=1
. It does not work for any chunk sizes larger than 1, where it will always result in the AttributeError: data
error.
In other words, I did the following:
headers = {"x-api-key": os.environ["OPENAI_API_KEY"]}
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", headers=headers, chunk_size=1)
retriever = FAISS.from_documents(docs, embeddings).as_retriever()
I'm using the Azure OpenAI API. Possibly, it's related to this.
@juliuskittler I got the exact same error as yours, the funny thing is only some of the documents would cause this error. The direct OpenAI API won't generate the error either.
@Franckegao did you try to set chunk_size=1
(see my previous comment)?
@juliuskittler Yes, but it would make the process super slow
Hi,
In connection to this, I understand that "FAISS.from_document" returns a list of Document object. Sorry if it's a dumb question but how could you possibly save this locally and be able to parse it back as such? I am experimenting on how to structure my data for embeddings and so I need to test each structure locally than calling the OpenAI API each time which is expensive.
Thanks
@elenmari https://python.langchain.com/docs/integrations/vectorstores/faiss:
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
Full example code:
'''
Python 3.10
pip install langchain langchain_openai sentence-transformers langchain_community
CPU:
pip install faiss-cpu
GPU:
pip install faiss-gpu
'''
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
model = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
embeddings = HuggingFaceEmbeddings(model_name = model)
def main():
# https://python.langchain.com/docs/integrations/vectorstores/faiss
# Uncomment the following line if you need to initialize FAISS with no AVX2 optimization
# os.environ['FAISS_NO_AVX2'] = '1'
loader = TextLoader("./state_of_the_union.txt", encoding='utf-8')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
#embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
query = "What did the president say about Ketanji Brown Jackson"
docs = new_db.similarity_search_with_score(query)
print(docs)
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
main()
@Franck-Dernoncourt sweet! Thank you.
hey there, brilliantly!
@Franck-Dernoncourt thank you for your example. I am trying the same with PyPDFLoader and am facing a similar issue - "BadRequestError: Unsupported data type". My code snippet - `#use langchain PDF loader loader = PyPDFLoader(fileName) documents = loader.load_and_split() text_splitter = CharacterTextSplitter(chunk_size= 1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
#Use Langchain to create the embeddings using text-embedding-ada-002
db = FAISS.from_documents(docs, embedding=embeddings)`
System Info
Platform: local development on MacOS Ventura Python version: 3.10.12
langchain.version: 0.0.288 faiss.version: 1.7.4 chromadb.version: 0.4.10 openai.version: 0.28.0
Who can help?
@hwchase17
Information
Related Components
Reproduction
Reproducible example
I tried to reproduce an example from this page: https://python.langchain.com/docs/integrations/vectorstores/faiss
The reproducible example (with path to the file https://github.com/hwchase17/chat-your-data/blob/master/state_of_the_union.txt adjusted) can be found below.
Error
The problem is, that I get an
AttributeError: data
error for bothdb = FAISS.from_documents(docs, embeddings)
anddb = Chroma.from_documents(docs, embeddings)
The traceback is as follows:
Expected behavior
The function should complete without an error.