langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.07k stars 15.41k forks source link

vectorstores/docarray: `_similarity_search_with_relevance_scores` raises a NotImplementedError #12843

Closed hudsonmendes closed 8 months ago

hudsonmendes commented 1 year ago

System Info

Langchain 0.0.329 Python 3.10.11 MacOS 12.7.1 (21G920)

Who can help?

No response

Information

Related Components

Reproduction


# Dependencies
import pathlib
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain import hub

# Setup
dir_data = pathlib.Path("../data_sample")
document_loader = DirectoryLoader(dir_data, show_progress=True)
documents = document_loader.load()
document_chunker = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=5)
document_chunks = document_chunker.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")
vector_store = DocArrayInMemorySearch.from_documents(document_chunks, embeddings)
llm = HuggingFacePipeline.from_model_id(
    task="text2text-generation",
    model_id="google/flan-t5-small",
    model_kwargs=dict(temperature=0.01, max_length=128, do_sample=True),
)
qa_rag_prompt = hub.pull("rlm/rag-prompt")
qa = RetrievalQA.from_chain_type(
    llm,
    retriever=vector_store.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}),
    chain_type_kwargs={"prompt": qa_rag_prompt},
    return_source_documents=True,
)

# OK: Supported by Vector Store (DocArrayInMemorySearch)
vector_store.similarity_search_with_score("What is the greatest ocean in the world?")

# NOK: NotImplemented @ `docarray.base`
answer = qa({"query": question}) # will fail

Error Message

    def _similarity_search_with_relevance_scores(
        self,
        query: str,
        k: int = 4,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Return docs and relevance scores, normalized on a scale from 0 to 1.

        0 is dissimilar, 1 is most similar.
        """
>       raise NotImplementedError()
E       NotImplementedError

venv/lib/python3.10/site-packages/langchain/vectorstores/docarray/base.py:140: NotImplementedError

Expected behavior

The {context} variable provided to the prompt should only be suffed by chunks retrieved with relevance score above the threshold.

dosubot[bot] commented 1 year ago

🤖

Hello @hudsonmendes,

I'm Dosu, a bot developed to assist you with questions, bugs and even help you become a contributor to LangChain. While we wait for a human maintainer, I'm here to ensure you get the help you need promptly.

I'm currently looking into your issue and will get back to you with a comprehensive response shortly.

hudsonmendes commented 1 year ago

I might be able to help with this one. If I managed to take it over, I will assign the issue to myself and send a PR.

dosubot[bot] commented 1 year ago

🤖

Based on the information you've provided and the context from the LangChain repository, it seems like the _similarity_search_with_relevance_scores method in the vectorstores/docarray module is raising a NotImplementedError because it's an abstract method that is intended to be overridden by subclasses. This is a common practice in object-oriented programming when a base class wants to declare a method but provide no implementation, leaving it up to subclasses to provide a specific implementation.

In the provided code, the DocArrayIndex class does not provide an implementation for the _similarity_search_with_relevance_scores method. This method is expected to return documents and relevance scores, normalized on a scale from 0 to 1, for a given query. However, the method simply raises a NotImplementedError, indicating that it must be implemented in a subclass.

The error you're seeing is not related to the Python version or the MacOS version you're using. It's simply because this method has not been implemented yet in the DocArrayIndex class or in any of its subclasses.

There are a few similar issues in the LangChain repository that have been solved:

  1. Why is this implementation of vectorstore not working?
  2. When search_type="similarity_score_threshold, retriever returns negative scores
  3. similarity_score_threshold NotImplementedError

The solutions suggested in these issues involve updating LangChain to the latest version, using a different retriever that supports the similarity_score_threshold feature, or implementing the _aget_relevant_documents method in the Pinecone retriever class yourself.

If you're comfortable with Python and have a good understanding of the LangChain framework, you could try to implement the _similarity_search_with_relevance_scores method in a subclass of DocArrayIndex. This would involve defining how the subclass should handle the similarity_score_threshold search type.

I hope this information helps you in resolving the issue. If you have any further questions or need more clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

hudsonmendes commented 1 year ago

Thank for the clarification.

I see that _similarity_search_with_relevance_scores is seen as an abstract method. Readability could potentially be improved by annotating the method with @abstractmethod (from the abc package). However, I understand that by doing so, it would force subclasses to implement it, breaking backwards compatibility and even leading to a different design than the one intended. So that part is clear to me.

The class that I referred to as lacking an implementation for the _similarity_search_with_relevance_scores is the base class of the docarray implementation. Following up on your recommendation, I trust that there should be an implementation of overriden implementation for this method as part of the DocArrayInMemorySearch class, which is currently lacking.

Again, I can go ahead and implement this, if we decide that my proposed design is resaonable.

Proposed Design

My proposal is that the DocArrayInMemorySearch receives an implementation of the _similarity_search_with_relevance_scores, perhaps using numpy to calculate similarity, according to the distance metric available in the self.embedding, should there be more than one option.

File: libs/langchain/langchain/vectorstores/docarray/in_memory.py Class: DocArrayInMemorySearch(DocArrayIndex)

def _similarity_search_with_relevance_scores(
        self,
        query: str,
        k: int = 4,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Return docs and relevance scores, normalized on a scale from 0 to 1.

        0 is dissimilar, 1 is most similar.
        """
        # TODO: use `self.embeddings` to discover which distance metric should be used
        # TODO: implementation of the distance-based search
mikquinlan commented 11 months ago

@hudsonmendes The (what looks like) hack to fix this is to return self.similarity_search_with_score(query, k=k, **kwargs). Deeper inspection shows there are some design (or maybe just doc) issues. e.g. similarity_search_with_score returns the cosine similarity score (higher number more similar) NOT relevance (lower number more similar).

hudsonmendes commented 11 months ago

@mikquinlan, _similarity_search_with_relevance_scores could certainly be developped to be consistent with the other vector stores. From cosine similarity score to relevance score, we could simple return 1 - consine_similarity (or similar transformation) and ensure it is consistent with what the other vector stores return in their implementations.

I'd do so in order to avoid breaking changes to the API.

However, I do belive there are some bigger design problems, such as the lack of annotation for @abstractmethods. Even if that would force sub-classes to explicitly implement this method, even when not supported, might be a better practice.

I'd start small - solve the issue of not supporting "similarity_score_threshold" with docarray vector stores and perhaps raise another issue to deal with (possible) design issues, which would probably a greater discussion.