Closed hudsonmendes closed 8 months ago
🤖
Hello @hudsonmendes,
I'm Dosu, a bot developed to assist you with questions, bugs and even help you become a contributor to LangChain. While we wait for a human maintainer, I'm here to ensure you get the help you need promptly.
I'm currently looking into your issue and will get back to you with a comprehensive response shortly.
I might be able to help with this one. If I managed to take it over, I will assign the issue to myself and send a PR.
🤖
Based on the information you've provided and the context from the LangChain repository, it seems like the _similarity_search_with_relevance_scores
method in the vectorstores/docarray
module is raising a NotImplementedError
because it's an abstract method that is intended to be overridden by subclasses. This is a common practice in object-oriented programming when a base class wants to declare a method but provide no implementation, leaving it up to subclasses to provide a specific implementation.
In the provided code, the DocArrayIndex
class does not provide an implementation for the _similarity_search_with_relevance_scores
method. This method is expected to return documents and relevance scores, normalized on a scale from 0 to 1, for a given query. However, the method simply raises a NotImplementedError
, indicating that it must be implemented in a subclass.
The error you're seeing is not related to the Python version or the MacOS version you're using. It's simply because this method has not been implemented yet in the DocArrayIndex
class or in any of its subclasses.
There are a few similar issues in the LangChain repository that have been solved:
search_type="similarity_score_threshold
, retriever returns negative scoresThe solutions suggested in these issues involve updating LangChain to the latest version, using a different retriever that supports the similarity_score_threshold
feature, or implementing the _aget_relevant_documents
method in the Pinecone
retriever class yourself.
If you're comfortable with Python and have a good understanding of the LangChain framework, you could try to implement the _similarity_search_with_relevance_scores
method in a subclass of DocArrayIndex
. This would involve defining how the subclass should handle the similarity_score_threshold
search type.
I hope this information helps you in resolving the issue. If you have any further questions or need more clarification, feel free to ask.
search_type="similarity_score_threshold
, retriever returns negative scoresThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Thank for the clarification.
I see that _similarity_search_with_relevance_scores
is seen as an abstract method. Readability could potentially be improved by annotating the method with @abstractmethod
(from the abc
package). However, I understand that by doing so, it would force subclasses to implement it, breaking backwards compatibility and even leading to a different design than the one intended. So that part is clear to me.
The class that I referred to as lacking an implementation for the _similarity_search_with_relevance_scores
is the base class of the docarray implementation. Following up on your recommendation, I trust that there should be an implementation of overriden implementation for this method as part of the DocArrayInMemorySearch
class, which is currently lacking.
Again, I can go ahead and implement this, if we decide that my proposed design is resaonable.
My proposal is that the DocArrayInMemorySearch
receives an implementation of the _similarity_search_with_relevance_scores
, perhaps using numpy to calculate similarity, according to the distance metric available in the self.embedding
, should there be more than one option.
File: libs/langchain/langchain/vectorstores/docarray/in_memory.py
Class: DocArrayInMemorySearch(DocArrayIndex)
def _similarity_search_with_relevance_scores(
self,
query: str,
k: int = 4,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
"""Return docs and relevance scores, normalized on a scale from 0 to 1.
0 is dissimilar, 1 is most similar.
"""
# TODO: use `self.embeddings` to discover which distance metric should be used
# TODO: implementation of the distance-based search
@hudsonmendes The (what looks like) hack to fix this is to return self.similarity_search_with_score(query, k=k, **kwargs)
. Deeper inspection shows there are some design (or maybe just doc) issues. e.g. similarity_search_with_score
returns the cosine similarity score (higher number more similar) NOT relevance (lower number more similar).
@mikquinlan, _similarity_search_with_relevance_scores
could certainly be developped to be consistent with the other vector stores. From cosine similarity score to relevance score, we could simple return 1 - consine_similarity
(or similar transformation) and ensure it is consistent with what the other vector stores return in their implementations.
I'd do so in order to avoid breaking changes to the API.
However, I do belive there are some bigger design problems, such as the lack of annotation for @abstractmethods
. Even if that would force sub-classes to explicitly implement this method, even when not supported, might be a better practice.
I'd start small - solve the issue of not supporting "similarity_score_threshold"
with docarray vector stores and perhaps raise another issue to deal with (possible) design issues, which would probably a greater discussion.
System Info
Langchain 0.0.329 Python 3.10.11 MacOS 12.7.1 (21G920)
Who can help?
No response
Information
Related Components
Reproduction
Error Message
Expected behavior
The
{context}
variable provided to the prompt should only be suffed by chunks retrieved with relevance score above the threshold.