langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.97k stars 15.14k forks source link

SelfQueryRetriever returns duplicate document data using with chromaDB #9532

Closed gokulstevee closed 11 months ago

gokulstevee commented 1 year ago

System Info

Langchain version = 0.0.268

Who can help?

No response

Information

Related Components

Reproduction

I have list of objects in roadmap.json and I tried to achieve the what was given in selfquery retriver documentation

with open('roadmap.json') as json_file:
  allRoadmap = json.load(json_file)

docs = []

for roadmap in allRoadmap:
  print(roadmap["name"])
  for section in roadmap["sections"]:
    single_doc = Document(
        page_content=f"This section is related to {roadmap['name']}",
        metadata={"roadmapName": roadmap["name"], "sectionTopic": section["name"]}
    )
    docs.append(single_doc)
    print(section["name"])
  print("\n")

vectorstore = Chroma.from_documents(docs, embeddings)

print(vectorstore)

for doc in docs:
  print(doc)

This above code outputs

page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Basics'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Tags and Elements'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Intermediate Concepts'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Advanced Concepts'}
page_content='This section is related to How Search Engine Works' metadata={'roadmapName': 'How Search Engine Works', 'sectionTopic': 'Internet'}
page_content='This section is related to How Search Engine Works' metadata={'roadmapName': 'How Search Engine Works', 'sectionTopic': 'Search Engine'}
page_content='This section is related to Github Roadmap' metadata={'roadmapName': 'Github Roadmap', 'sectionTopic': 'Section 1'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Introduction'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Concepts'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Hands On'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Introduction'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Queries'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Data Modeling'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Aggregation'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Getting Started'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Basics'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Conditions and Loops'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Arrays'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Functions'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Solving Problems'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basics'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Color and Background'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Typography and Fonts'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Spacing in CSS'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basic Styling'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Positioning Techniques'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Advanced Styling'}
page_content='This section is related to Google Colab' metadata={'roadmapName': 'Google Colab', 'sectionTopic': 'Section 1'}
page_content='This section is related to MySQL' metadata={'roadmapName': 'MySQL', 'sectionTopic': 'Section 1'}
page_content='This section is related to Docker' metadata={'roadmapName': 'Docker', 'sectionTopic': 'Section 1'}
page_content='This section is related to AWS Lambda' metadata={'roadmapName': 'AWS Lambda', 'sectionTopic': 'Section 1'}
page_content='This section is related to Java' metadata={'roadmapName': 'Java', 'sectionTopic': 'Section 1'}

I tried to retrieve data from using selfQuery

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="sectionTopic",
        description="The section topic of the roadmap",
        type="string",
    ),
    AttributeInfo(
        name="roadmapName",
        description="Name of the roadmap",
        type="string",
    ),
]
document_content_description = "Roadmap section topics"
llm = OpenAI(temperature=0.1)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True,
    search_kwargs={"k": 7}
)

retriever.get_relevant_documents("What are unique section topics that are related to css")

For the above code it gave repeated document.

query='css' filter=None limit=None
[Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Advanced Styling'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Typography and Fonts'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basic Styling'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Color and Background'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Spacing in CSS'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basic Styling'}),
 Document(page_content='This section is related to CSS for beginners', metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Advanced Styling'})]

Expected behavior

It should return 7 unique documents

page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basics'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Color and Background'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Typography and Fonts'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Spacing in CSS'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basic Styling'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Positioning Techniques'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Advanced Styling'}
dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for providing a detailed description of your issue. From your description, it seems like you're experiencing duplicate documents when using the SelfQueryRetriever with Chroma vector store.

The SelfQueryRetriever retrieves documents relevant to a given query by using a vector store and a language model chain (LLMChain) to generate the vector store queries. The retrieval process is handled by the _get_relevant_documents method. This method uses the LLMChain to predict and parse the structured query, which is then translated into vector store search parameters by the structured_query_translator. The search parameters are then passed to the vector store's search method along with the query and search type to retrieve the relevant documents.

The issue you're experiencing might be due to the way the Chroma vector store handles the search. The Chroma vector store uses cosine similarity to find the most similar vectors to the query vector. If multiple vectors have the same cosine similarity score, they might all be returned, leading to duplicate documents.

One way to handle this issue is to add a post-processing step after retrieving the documents to remove duplicates. You can do this by converting the list of documents to a set, which automatically removes duplicates because sets only allow unique elements. Here's how you can do it:

docs = retriever.get_relevant_documents("What are unique section topics that are related to css")
unique_docs = list(set(docs))

Please note that this will only work if the Document class implements the __hash__ and __eq__ methods, which are used by the set to determine uniqueness. If the Document class doesn't implement these methods, you might need to implement them yourself or find another way to remove duplicates.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

gokulstevee commented 1 year ago

@dosu-beta Yeah right we can manage to remove duplicate data but I also facing issue of returning irrelevant document.

for the code

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="sectionTopic",
        description="The section topic of the roadmap",
        type="string",
    ),
    AttributeInfo(
        name="roadmapName",
        description="Name of the roadmap",
        type="string",
    ),
]
document_content_description = "Roadmap section topics"
llm = OpenAI(temperature=0.1)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True,
    search_kwargs={"k": 20}
)

retriever.get_relevant_documents("What are unique section topics that are related to python")

The above code returns the below document

query='python' filter=None limit=None
[Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Conditions and Loops'}),
 Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Functions'}),
 Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Solving Problems'}),
 Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Arrays'}),
 Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Getting Started'}),
 Document(page_content='This section is related to Python for Beginers', metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Basics'}),
 Document(page_content='This section is related to Google Colab', metadata={'roadmapName': 'Google Colab', 'sectionTopic': 'Section 1'}),
 Document(page_content='This section is related to Java', metadata={'roadmapName': 'Java', 'sectionTopic': 'Section 1'}),
 Document(page_content='This section is related to MongoDB', metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Data Modeling'}),
 Document(page_content='This section is related to MongoDB', metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Aggregation'}),
 Document(page_content='This section is related to MongoDB', metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Queries'}),
 Document(page_content='This section is related to MongoDB', metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Introduction'}),
 Document(page_content='This section is related to AWS Lambda', metadata={'roadmapName': 'AWS Lambda', 'sectionTopic': 'Section 1'}),
 Document(page_content='This section is related to Docker', metadata={'roadmapName': 'Docker', 'sectionTopic': 'Section 1'}),
 Document(page_content='This section is related to MySQL', metadata={'roadmapName': 'MySQL', 'sectionTopic': 'Section 1'}),
 Document(page_content='This section is related to HTML for beginners', metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Advanced Concepts'}),
 Document(page_content='This section is related to HTML for beginners', metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Intermediate Concepts'}),
 Document(page_content='This section is related to HTML for beginners', metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Basics'}),
 Document(page_content='This section is related to HTML for beginners', metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Tags and Elements'}),
 Document(page_content='This section is related to Bootstrap', metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Introduction'})]

I have 6 document related to python but it returns irrelevant data about HTML, Mongodb, Java etc...

Expected output is "Returning of only repeated 7 python documents of 20 numbers"

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for reaching out and providing a detailed explanation of your issue. I understand that you're trying to retrieve documents relevant to Python, but the SelfQueryRetriever is returning documents related to other topics as well.

Currently, the SelfQueryRetriever class in LangChain does not have a filter parameter that can be used to filter the results by metadata. The class has several parameters such as vectorstore, llm_chain, search_type, search_kwargs, structured_query_translator, verbose, and use_original_query, but none of these parameters are designed to filter results by metadata.

However, you can add a filtering step after retrieving the documents. Here's an example of how you can do it:

documents = retriever.get_relevant_documents("What are unique section topics that are related to python")

# Filter documents by roadmapName
filtered_documents = [doc for doc in documents if 'Python' in doc.metadata['roadmapName']]

# Now, filtered_documents only contains documents related to Python

In this example, we're filtering the documents by the roadmapName metadata field. If a document's roadmapName contains 'Python', it's considered relevant; otherwise, it's filtered out.

Please note that this is a simple example and might not cover all cases. You might need to adjust the filtering logic based on your specific requirements.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

gokulstevee commented 1 year ago

Yeah this solution also I tried but the documents that are returned doesn't have "roadmap" property only some documents have "roadmap" property and some documents don't. Here is the code where I map "roadmap" property to all documents.

with open('roadmap.json') as json_file:
  allRoadmap = json.load(json_file)

docs = []

for roadmap in allRoadmap:
  print(roadmap["name"])
  for section in roadmap["sections"]:
    single_doc = Document(
        page_content=f"This section is related to {roadmap['name']}",
        metadata={"roadmapName": roadmap["name"], "sectionTopic": section["name"]}
    )
    docs.append(single_doc)
    print(section["name"])
  print("\n")

vectorstore = Chroma.from_documents(docs, embeddings)

for doc in docs:
  print(doc)

Below is the output of documents that are to be embedded in chroma vectorstore.

page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Basics'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'HTML Tags and Elements'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Intermediate Concepts'}
page_content='This section is related to HTML for beginners' metadata={'roadmapName': 'HTML for beginners', 'sectionTopic': 'Advanced Concepts'}
page_content='This section is related to How Search Engine Works' metadata={'roadmapName': 'How Search Engine Works', 'sectionTopic': 'Internet'}
page_content='This section is related to How Search Engine Works' metadata={'roadmapName': 'How Search Engine Works', 'sectionTopic': 'Search Engine'}
page_content='This section is related to Github Roadmap' metadata={'roadmapName': 'Github Roadmap', 'sectionTopic': 'Section 1'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Introduction'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Concepts'}
page_content='This section is related to Bootstrap' metadata={'roadmapName': 'Bootstrap', 'sectionTopic': 'Hands On'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Introduction'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Queries'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Data Modeling'}
page_content='This section is related to MongoDB' metadata={'roadmapName': 'MongoDB', 'sectionTopic': 'Aggregation'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Getting Started'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Basics'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Conditions and Loops'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Arrays'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Functions'}
page_content='This section is related to Python for Beginers' metadata={'roadmapName': 'Python for Beginers', 'sectionTopic': 'Solving Problems'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basics'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Color and Background'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Typography and Fonts'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Spacing in CSS'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Basic Styling'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Positioning Techniques'}
page_content='This section is related to CSS for beginners' metadata={'roadmapName': 'CSS for beginners', 'sectionTopic': 'Advanced Styling'}
page_content='This section is related to Google Colab' metadata={'roadmapName': 'Google Colab', 'sectionTopic': 'Section 1'}
page_content='This section is related to MySQL' metadata={'roadmapName': 'MySQL', 'sectionTopic': 'Section 1'}
page_content='This section is related to Docker' metadata={'roadmapName': 'Docker', 'sectionTopic': 'Section 1'}
page_content='This section is related to AWS Lambda' metadata={'roadmapName': 'AWS Lambda', 'sectionTopic': 'Section 1'}
page_content='This section is related to Java' metadata={'roadmapName': 'Java', 'sectionTopic': 'Section 1'}
dosubot[bot] commented 11 months ago

Hi, @gokulstevee! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported was about the SelfQueryRetriever returning duplicate documents when used with chromaDB. I acknowledged the issue and provided a solution to remove duplicates by converting the list of documents to a set. However, you also mentioned that you are facing the issue of returning irrelevant documents. I suggested adding a filtering step after retrieving the documents to remove irrelevant ones, but you mentioned that some documents don't have the required metadata field for filtering.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

khushalpatel1981 commented 7 months ago

Can we have some Kind of "Distinct" Keyword functionality like SQL ????