phidatahq / phidata

Build AI Assistants with memory, knowledge and tools.
https://docs.phidata.com
Mozilla Public License 2.0
11.22k stars 1.66k forks source link

Cannot use non-vectorstore retrievers in `LangChainKnowledgeBase` #1096

Open Joelius300 opened 1 month ago

Joelius300 commented 1 month ago

LangChainKnowledgeBase works great when you already have a LangChain-based retrieval setup that uses vectorstores, but when you want something more basic, like a TFIDFRetriever, you cannot use it because of an instance check.

https://github.com/phidatahq/phidata/blob/3a39189d5a9af5194cce6dc594df823f7e8f9815/phi/knowledge/langchain.py#L37-L38

Looking at the source of VectorStoreRetriever, it doesn't override the invoke method, which is the only one the LangChainKnowledgeBase uses (here specifically).

After some manual testing for confirmation, I believe it would be best to relax the check to BaseRetriever instead, which defines/implements[^1] the required invoke method. What do you think? :)

[^1]: technically, it's already defined by Runnable, but I would argue it makes much more sense semantically to do an instance check on BaseRetriever.

Joelius300 commented 4 days ago

Motivation / Example

The following snippet does not work and throws the above mentioned ValueError. This is despite that fact that it would work perfectly fine and traditional retrieval methods that aren't based on vector stores are just as valid in that scenario.

Taken from https://github.com/Joelius300/cudeschin-rag/commit/73ae0992b9

loader = DirectoryLoader(cudeschin_path / "content/de", glob="**/*.md",
                         loader_cls=TextLoader,
                         loader_kwargs=dict(encoding="utf-8"))
splitter = MarkdownTextSplitter(is_separator_regex=True)
docs = splitter.split_documents(loader.load())

retriever = TFIDFRetriever.from_documents(docs, k=n_documents)
return LangChainKnowledgeBase(retriever=retriever, num_documents=n_documents)

To work around this issue, I created a custom knowledge base as follows and replaced LangChainKnowledgeBase above with it:

class LessPickyLangChainKnowledgeBase(LangChainKnowledgeBase):
    def search(self, query: str, num_documents: Optional[int] = None) -> List[LangChainDocument]:
        if self.retriever is None:
            raise ValueError("must provide retriever")
        if not isinstance(self.retriever, BaseRetriever):  # no reason for VectorStoreRetriever
            raise ValueError(f"Retriever is not of type BaseRetriever: {self.retriever}")
        _num_documents = num_documents or self.num_documents
        # logger.debug(f"Getting {_num_documents} relevant documents for query: {query}")
        lc_documents: List[LangChainDocument] = self.retriever.invoke(input=query)
        documents = []
        for lc_doc in lc_documents:
            documents.append(
                Document(
                    content=lc_doc.page_content,
                    meta_data=lc_doc.metadata,
                )
            )
        return documents