FAISS.from_documents(docs, embeddings) and Chroma.from_documents(docs, embeddings) result in `AttributeError: data`.

juliuskittler commented 1 year ago

System Info

Platform: local development on MacOS Ventura Python version: 3.10.12

langchain.version: 0.0.288 faiss.version: 1.7.4 chromadb.version: 0.4.10 openai.version: 0.28.0

Who can help?

@hwchase17

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[X] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Reproducible example

I tried to reproduce an example from this page: https://python.langchain.com/docs/integrations/vectorstores/faiss

The reproducible example (with path to the file https://github.com/hwchase17/chat-your-data/blob/master/state_of_the_union.txt adjusted) can be found below.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
import os

# Get documents
loader = TextLoader("../src/data/raw_files/state_of_the_union.txt") # path adjusted
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Prepare embedding function
headers = {"x-api-key": os.environ["OPENAI_API_KEY"]}
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", headers=headers)

# Try to get vectordb with FAISS
db = FAISS.from_documents(docs, embeddings)

# Try to get vectordb with Chroma
db = Chroma.from_documents(docs, embeddings)

Error

The problem is, that I get an AttributeError: data error for both db = FAISS.from_documents(docs, embeddings) and db = Chroma.from_documents(docs, embeddings)

The traceback is as follows:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/openai/openai_object.py:59, in OpenAIObject.__getattr__(self, k)
     58 try:
---> 59     return self[k]
     60 except KeyError as err:

KeyError: 'data'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[14], line 1
----> 1 db = Chroma.from_documents(docs, embeddings)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:637, in Chroma.from_documents(cls, documents, embedding, ids, collection_name, persist_directory, client_settings, client, collection_metadata, **kwargs)
    635 texts = [doc.page_content for doc in documents]
    636 metadatas = [doc.metadata for doc in documents]
--> 637 return cls.from_texts(
    638     texts=texts,
    639     embedding=embedding,
    640     metadatas=metadatas,
    641     ids=ids,
    642     collection_name=collection_name,
    643     persist_directory=persist_directory,
    644     client_settings=client_settings,
    645     client=client,
    646     collection_metadata=collection_metadata,
    647     **kwargs,
    648 )

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:601, in Chroma.from_texts(cls, texts, embedding, metadatas, ids, collection_name, persist_directory, client_settings, client, collection_metadata, **kwargs)
    573 """Create a Chroma vectorstore from a raw documents.
    574 
    575 If a persist_directory is specified, the collection will be persisted there.
   (...)
    590     Chroma: Chroma vectorstore.
    591 """
    592 chroma_collection = cls(
    593     collection_name=collection_name,
    594     embedding_function=embedding,
   (...)
    599     **kwargs,
    600 )
--> 601 chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
    602 return chroma_collection

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:188, in Chroma.add_texts(self, texts, metadatas, ids, **kwargs)
    186 texts = list(texts)
    187 if self._embedding_function is not None:
--> 188     embeddings = self._embedding_function.embed_documents(texts)
    189 if metadatas:
    190     # fill metadatas with empty dicts if somebody
    191     # did not specify metadata for all texts
    192     length_diff = len(texts) - len(metadatas)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/embeddings/openai.py:483, in OpenAIEmbeddings.embed_documents(self, texts, chunk_size)
    471 """Call out to OpenAI's embedding endpoint for embedding search docs.
    472 
    473 Args:
   (...)
    479     List of embeddings, one for each text.
    480 """
    481 # NOTE: to keep things simple, we assume the list may contain texts longer
    482 #       than the maximum context and use length-safe embedding function.
--> 483 return self._get_len_safe_embeddings(texts, engine=self.deployment)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/embeddings/openai.py:367, in OpenAIEmbeddings._get_len_safe_embeddings(self, texts, engine, chunk_size)
    364     _iter = range(0, len(tokens), _chunk_size)
    366 for i in _iter:
--> 367     response = embed_with_retry(
    368         self,
    369         input=tokens[i : i + _chunk_size],
    370         **self._invocation_params,
    371     )
    372     batched_embeddings.extend(r["embedding"] for r in response["data"])
    374 results: List[List[List[float]]] = [[] for _ in range(len(texts))]

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/embeddings/openai.py:107, in embed_with_retry(embeddings, **kwargs)
    104     response = embeddings.client.create(**kwargs)
    105     return _check_response(response, skip_empty=embeddings.skip_empty)
--> 107 return _embed_with_retry(**kwargs)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/tenacity/__init__.py:289, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
    287 @functools.wraps(f)
    288 def wrapped_f(*args: t.Any, **kw: t.Any) -> t.Any:
--> 289     return self(f, *args, **kw)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/tenacity/__init__.py:379, in Retrying.__call__(self, fn, *args, **kwargs)
    377 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs)
    378 while True:
--> 379     do = self.iter(retry_state=retry_state)
    380     if isinstance(do, DoAttempt):
    381         try:

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/tenacity/__init__.py:314, in BaseRetrying.iter(self, retry_state)
    312 is_explicit_retry = fut.failed and isinstance(fut.exception(), TryAgain)
    313 if not (is_explicit_retry or self.retry(retry_state)):
--> 314     return fut.result()
    316 if self.after is not None:
    317     self.after(retry_state)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/concurrent/futures/_base.py:451, in Future.result(self, timeout)
    449     raise CancelledError()
    450 elif self._state == FINISHED:
--> 451     return self.__get_result()
    453 self._condition.wait(timeout)
    455 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/tenacity/__init__.py:382, in Retrying.__call__(self, fn, *args, **kwargs)
    380 if isinstance(do, DoAttempt):
    381     try:
--> 382         result = fn(*args, **kwargs)
    383     except BaseException:  # noqa: B902
    384         retry_state.set_exception(sys.exc_info())  # type: ignore[arg-type]

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/langchain/embeddings/openai.py:104, in embed_with_retry.<locals>._embed_with_retry(**kwargs)
    102 @retry_decorator
    103 def _embed_with_retry(**kwargs: Any) -> Any:
--> 104     response = embeddings.client.create(**kwargs)
    105     return _check_response(response, skip_empty=embeddings.skip_empty)

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/openai/api_resources/embedding.py:38, in Embedding.create(cls, *args, **kwargs)
     35 # If a user specifies base64, we'll just return the encoded string.
     36 # This is only for the default case.
     37 if not user_provided_encoding_format:
---> 38     for data in response.data:
     39 
     40         # If an engine isn't using this optimization, don't do anything
     41         if type(data["embedding"]) == str:
     42             assert_has_numpy()

File ~/mambaforge/envs/streamlit-chatbot/lib/python3.10/site-packages/openai/openai_object.py:61, in OpenAIObject.__getattr__(self, k)
     59     return self[k]
     60 except KeyError as err:
---> 61     raise AttributeError(*err.args)

AttributeError: data

Expected behavior

The function should complete without an error.

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for reaching out and providing a detailed description of the issue you're facing.

Based on the context provided, it seems there might be a misunderstanding about the usage of the FAISS.from_documents(docs, embeddings) and Chroma.from_documents(docs, embeddings) methods.

In the LangChain framework, the FAISS class does not have a from_documents method. Instead, it provides two similar methods: FAISS.from_texts(texts, embeddings) and FAISS.from_embeddings(text_embeddings, embeddings).

The FAISS.from_texts(texts, embeddings) method creates a FAISS vectorstore from a list of documents, while the FAISS.from_embeddings(text_embeddings, embeddings) method creates a FAISS vectorstore from a list of pre-computed text-embedding pairs.

Here's an example of how to use the FAISS.from_embeddings method:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Assuming docs is a list of your documents and embeddings is a list of their corresponding embeddings
text_embeddings = zip(docs, embeddings)

embeddings = OpenAIEmbeddings()
faiss = FAISS.from_embeddings(text_embeddings, embeddings)

For the Chroma class, it does have a from_documents method. This method creates a Chroma vectorstore from a list of documents. Here's an example of how to use it:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Assuming docs is a list of your documents
embeddings = OpenAIEmbeddings()
chroma = Chroma.from_documents(docs, embeddings)

Please replace FAISS.from_documents(docs, embeddings) and Chroma.from_documents(docs, embeddings) with the correct method based on your requirements.

If you continue to face issues, please provide more details and I'll be happy to assist you further.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

juliuskittler commented 1 year ago

@dosu-bot

Regarding FAISS

You are saying that FAISS does not have a from_documents method. However, in the example it does have such a method. See https://python.langchain.com/docs/integrations/vectorstores/faiss, where there is the code snippet:

db = FAISS.from_documents(docs, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

Moreover, FAISS.from_documents results in the printout <bound method VectorStore.from_documents of <class 'langchain.vectorstores.faiss.FAISS'>>.

-> Your answer does not explain why I get the error AttributeError: data when calling FAISS.from_documents.

Regarding Chroma

You are saying that there is indeed a method Chroma.from_documents. I have used this method in my reproducible example.

-> Your answer does not explain why I get the error AttributeError: data when calling Chroma.from_documents.

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for reaching out and providing a detailed description of your issue. The AttributeError: data error you're encountering when trying to use the FAISS.from_documents(docs, embeddings) and Chroma.from_documents(docs, embeddings) methods in the LangChain Python framework is likely due to the fact that the from_documents method is expecting a list of Document objects and an Embeddings object as arguments.

The from_documents method is defined in the Chroma class as follows:

@classmethod
def from_documents(
    cls: Type[Chroma],
    documents: List[Document],
    embedding: Optional[Embeddings] = None,
    ids: Optional[List[str]] = None,
    collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
    persist_directory: Optional[str] = None,
    client_settings: Optional[chromadb.config.Settings] = None,
    client: Optional[chromadb.Client] = None,
    collection_metadata: Optional[Dict] = None,
    **kwargs: Any,
) -> Chroma:
    """Create a Chroma vectorstore from a raw documents.

    If a persist_directory is specified, the collection will be persisted there.
    Otherwise, the data will be ephemeral in-memory.
    """
    texts = [doc.page_content for doc in documents]
    metadatas = [doc.metadata for doc in documents]
    return cls.from_texts(
        texts=texts,
        embedding=embedding,
        metadatas=metadatas,
        ids=ids,
        collection_name=collection_name,
        persist_directory=persist_directory,
        client_settings=client_settings,
        client=client,
        collection_metadata=collection_metadata,
        **kwargs,
    )

As you can see, it's extracting the page_content and metadata from each Document in the documents list and then calling the from_texts method with these extracted texts and the provided Embeddings object.

If you're encountering an AttributeError: data error, it's likely because either the documents argument you're passing to from_documents is not a list of Document objects, or the embedding argument is not an Embeddings object.

To fix this error, ensure that the documents argument is a list of Document objects and the embedding argument is an Embeddings object. If you're not sure how to create these objects, you may need to refer to the documentation or source code of the LangChain Python framework for more information.

In the context shared, the Document objects are created with page_content and metadata attributes, and the Embeddings object is created using the OpenAIEmbeddings class. The from_documents method is then called with these Document objects and Embeddings object as arguments. If you're following a similar process and still encountering the error, there may be an issue with the data you're passing to the Document or Embeddings constructors.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

hchittilla commented 1 year ago

@juliuskittler I was not able to reproduce your issue when using

langchain 0.0.286 faiss-cpu 1.7.4 chromadb 0.4.8

worked smoothly for me; not sure if that helps but maybe my versions will solve your issue

juliuskittler commented 1 year ago

Alright, thank you for getting back. I am actually getting the same error when using exactly the same versions as you (see below).

langchain 0.0.286 faiss-cpu 1.7.4 chromadb 0.4.8

I have also tried different openai versions (0.28.1, 0.28.0, 0.27.9, 0.27.8, 0.27.7) but the error persists.

I will update this in case I have time to investigate further. For now, FAISS.from_embeddings works for me at least (and I can use it instead of FAISS.from_documents).

juliuskittler commented 1 year ago

The .from_documents method works for me only if I expliclity set chunk_size=1. It does not work for any chunk sizes larger than 1, where it will always result in the AttributeError: data error.

In other words, I did the following:

headers = {"x-api-key": os.environ["OPENAI_API_KEY"]}
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", headers=headers, chunk_size=1)
retriever = FAISS.from_documents(docs, embeddings).as_retriever()

I'm using the Azure OpenAI API. Possibly, it's related to this.

Franckegao commented 1 year ago

@juliuskittler I got the exact same error as yours, the funny thing is only some of the documents would cause this error. The direct OpenAI API won't generate the error either.

juliuskittler commented 1 year ago

@Franckegao did you try to set chunk_size=1 (see my previous comment)?

Franckegao commented 1 year ago

@juliuskittler Yes, but it would make the process super slow

elenmari commented 10 months ago

Hi,

In connection to this, I understand that "FAISS.from_document" returns a list of Document object. Sorry if it's a dumb question but how could you possibly save this locally and be able to parse it back as such? I am experimenting on how to structure my data for embeddings and so I need to test each structure locally than calling the OpenAI API each time which is expensive.

Thanks

Franck-Dernoncourt commented 9 months ago

@elenmari https://python.langchain.com/docs/integrations/vectorstores/faiss:

db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)

Full example code:

'''
Python 3.10
pip install langchain langchain_openai sentence-transformers langchain_community

CPU:
pip install faiss-cpu

GPU:
pip install faiss-gpu
'''

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

model = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
embeddings = HuggingFaceEmbeddings(model_name = model)

def main():
    # https://python.langchain.com/docs/integrations/vectorstores/faiss

    # Uncomment the following line if you need to initialize FAISS with no AVX2 optimization
    # os.environ['FAISS_NO_AVX2'] = '1'

    loader = TextLoader("./state_of_the_union.txt", encoding='utf-8')
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    #embeddings = OpenAIEmbeddings()
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")
    new_db = FAISS.load_local("faiss_index", embeddings)

    query = "What did the president say about Ketanji Brown Jackson"
    docs = new_db.similarity_search_with_score(query)
    print(docs)

# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    main()

elenmari commented 9 months ago

@Franck-Dernoncourt sweet! Thank you.

KholmogorovEA commented 9 months ago

hey there, brilliantly!

riokomoo12356 commented 8 months ago

@Franck-Dernoncourt thank you for your example. I am trying the same with PyPDFLoader and am facing a similar issue - "BadRequestError: Unsupported data type". My code snippet - `#use langchain PDF loader loader = PyPDFLoader(fileName) documents = loader.load_and_split() text_splitter = CharacterTextSplitter(chunk_size= 1000, chunk_overlap=0)

docs = text_splitter.split_documents(documents)

#Use Langchain to create the embeddings using text-embedding-ada-002
db = FAISS.from_documents(docs, embedding=embeddings)`

langchain-ai / langchain