langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.25k stars 14.51k forks source link

Why doesn't the TileDB vector store implementation have a "from_documents" method when the instructions say it does... #22964

Open BBC-Esq opened 2 months ago

BBC-Esq commented 2 months ago

Checked other resources

Example Code

using the standard tiledb code located in tiledb.py

Error Message and Stack Trace (if applicable)

there's no specific error to provide

Description

tiledb source code doesn't have "from_documents" method despite the online instructions saying it does.

System Info

windows 10

keenborder786 commented 2 months ago

TileDB is the child class of VectorStore from Core. The VectorStore has the class method called from_document here

ashishpatel26 commented 2 months ago

Resolution:

Investigation and Resolution Steps:

  1. Verify Documentation:

    • Confirm if the TileDB vector store documentation indeed states that the from_documents method should exist.
    • If it does, there may be a discrepancy between the documentation and the actual implementation.
  2. Code Review:

    • Review the TileDB vector store implementation in tiledb.py to confirm the absence of the from_documents method.
    • Cross-reference with the base VectorStore class in Core to see if the method should be inherited or needs explicit implementation.
  3. Possible Fix:

    • If from_documents should be inherited, ensure TileDB properly inherits and exposes this method.
    • If TileDB needs an explicit implementation, add the method based on the base class definition and functionality.
  4. Update Documentation:

    • If the method should not exist, correct the documentation to avoid future confusion.

Proposed Code Fix: Add the from_documents method to the TileDB vector store implementation if it should be explicitly defined.

# File path: tiledb.py

from vector_store import VectorStore  # Assuming VectorStore is defined in vector_store module

class TileDB(VectorStore):
    @classmethod
    def from_documents(cls, documents, embedding_function):
        """
        Creates a TileDB vector store from a list of documents.

        Args:
            documents (list): List of documents to be stored.
            embedding_function (callable): Function to convert documents to embeddings.

        Returns:
            TileDB: An instance of the TileDB vector store.
        """
        # Convert documents to embeddings
        embeddings = [embedding_function(doc) for doc in documents]
        # Create an instance of TileDB and store embeddings
        instance = cls()
        instance.store_embeddings(embeddings)
        return instance

    def store_embeddings(self, embeddings):
        """
        Store embeddings in the TileDB storage.

        Args:
            embeddings (list): List of embeddings to be stored.
        """
        # Implement the logic to store embeddings in TileDB
        pass

Follow-up:


a. Implement the proposed code fix for the TileDB vector store. b. Update the LangChain documentation to accurately reflect the methods available in TileDB vector store.

BBC-Esq commented 2 months ago

When I try to use the from_documents method, however, it keeps giving me various errors and, moreover, will simply exit early without giving any error; for example:

AttributeError: 'int' object has no attribute '_length_function'
TypeError: 'str_iterator' object does not support the context manager protocol
TypeError: 'dict' object is not callable
TypeError: unsupported operand type(s) for +: 'int' and 'str'

After approximately 2 days of painstaking troubleshooting, I was able to hack a solution by modifying the sentence-transformers library itself; specifically the _text_length method within SentenceTransformers.py. This modification enables _text_length to accept a "single string" in addition to a "list of strings" like it currently does.

Here is the hack:

def _text_length(self, text: Union[str, List[str], List[int], List[List[int]]]):
    """
    Help function to get the length for the input text. Text can be either
    a list of ints (which means a single text as input), or a tuple of list of ints
    (representing several text inputs to the model).
    """
    if isinstance(text, str):
        return len(text)
    elif isinstance(text, dict):  # {key: value} case
        return len(next(iter(text.values())))
    elif not hasattr(text, "__len__"):  # Object has no len() method
        return 1
    elif len(text) == 0 or isinstance(text[0], int):  # Empty string or list of ints
        return len(text)
    else:
        return sum([len(t) for t in text])  # Sum of length of individual strings

Obviously this hack isn't ideal because it requires modifying the sentence-transformers source code and could lead to instabilities with other functionalities of the sentence-transformers library...

Another solution that AI suggested is to use the from_texts method fro TileDB instead. However, this would require decomposing all of the "document objects" into two lists...

1) A list of strings constituting the "chunks; and 2) A list of dictionaries constituting all of the metadata formerly contained within the document object.

I haven't tried this solution yet, but instead submitted this issue.

Here's how I'm using TileDB. Please note...in my program the "texts" variable is a list of "document objects" and the "embedding" parameter uses either HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings or HuggingFaceBgeEmbeddings:

        db = TileDB.from_documents(
            documents=texts,
            embedding=embeddings,
            index_uri=str(self.PERSIST_DIRECTORY),
            allow_dangerous_deserialization=True,
            metric="euclidean",
            index_type="FLAT",
        )