Open Joelius300 opened 1 month ago
The following snippet does not work and throws the above mentioned ValueError
. This is despite that fact that it would work perfectly fine and traditional retrieval methods that aren't based on vector stores are just as valid in that scenario.
Taken from https://github.com/Joelius300/cudeschin-rag/commit/73ae0992b9
loader = DirectoryLoader(cudeschin_path / "content/de", glob="**/*.md",
loader_cls=TextLoader,
loader_kwargs=dict(encoding="utf-8"))
splitter = MarkdownTextSplitter(is_separator_regex=True)
docs = splitter.split_documents(loader.load())
retriever = TFIDFRetriever.from_documents(docs, k=n_documents)
return LangChainKnowledgeBase(retriever=retriever, num_documents=n_documents)
To work around this issue, I created a custom knowledge base as follows and replaced LangChainKnowledgeBase
above with it:
class LessPickyLangChainKnowledgeBase(LangChainKnowledgeBase):
def search(self, query: str, num_documents: Optional[int] = None) -> List[LangChainDocument]:
if self.retriever is None:
raise ValueError("must provide retriever")
if not isinstance(self.retriever, BaseRetriever): # no reason for VectorStoreRetriever
raise ValueError(f"Retriever is not of type BaseRetriever: {self.retriever}")
_num_documents = num_documents or self.num_documents
# logger.debug(f"Getting {_num_documents} relevant documents for query: {query}")
lc_documents: List[LangChainDocument] = self.retriever.invoke(input=query)
documents = []
for lc_doc in lc_documents:
documents.append(
Document(
content=lc_doc.page_content,
meta_data=lc_doc.metadata,
)
)
return documents
LangChainKnowledgeBase
works great when you already have a LangChain-based retrieval setup that uses vectorstores, but when you want something more basic, like aTFIDFRetriever
, you cannot use it because of an instance check.https://github.com/phidatahq/phidata/blob/3a39189d5a9af5194cce6dc594df823f7e8f9815/phi/knowledge/langchain.py#L37-L38
Looking at the source of
VectorStoreRetriever
, it doesn't override theinvoke
method, which is the only one theLangChainKnowledgeBase
uses (here specifically).After some manual testing for confirmation, I believe it would be best to relax the check to
BaseRetriever
instead, which defines/implements[^1] the requiredinvoke
method. What do you think? :)[^1]: technically, it's already defined by
Runnable
, but I would argue it makes much more sense semantically to do an instance check onBaseRetriever
.