Out of dataset answer and reference link provided for RAG example

aiquick commented 1 year ago

System Info

Ubuntu 22 Langchain version

Who can help?

No response

Information

[X] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Current behaviour:

I'm using the RAG example and feeding my own database of 1 football article. The Pinecone DB is a brand new database and only contains vectors from the football article.

When I do qa_with_sources(query="Who is Sachin Tendulkar") it provides me an answer and a link as a reference. This is not the expected behavior.

I have not fed any article about Sachin Tendulkar to the database. How and why/where from is it getting the answer and the link?

Now, If I add more articles only about football, push the vector count in the database to around 90. And then I ask the same question, query="Who is Sachin Tendulkar", it is not able to give the answer, which is the expected behavior.

I wonder if the fullness of the vector db makes it more accurate? Has anyone else seen this?

Repro:

Create a new Vector DB on pinecone. Use this example to feed in a football article.

Run query="Who is Sachin Tendulkar". Note the result contains a reference and an answer. (Unexpected)

Now, create a more full db, with more articles and ask the same query. Note that the results is empty as expected.

Expected behavior

Since the database does not contain any article or mention of Sachin Tendulkar, it should not provide any answer, and instead say "This is not mentioned in the database".

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

Thank you for bringing this to our attention. The behavior you're observing is likely due to the nature of the RAG (Retrieval-Augmented Generation) model used in the LangChain framework.

RAG models are designed to retrieve relevant documents from a database and use them to generate responses. When the database is sparse or contains only a few documents, the model might still try to generate a response based on the limited information it has. This could explain why you're seeing an answer for a query about "Sachin Tendulkar" even though your database only contains a football article.

As for the reference link, it's possible that the model is defaulting to a pre-set link when it can't find a relevant document in the database. This could be a fallback mechanism to ensure that a link is always provided, even if it's not directly relevant to the query.

When you add more articles to the database, the model has more information to work with and can make more accurate retrievals. This could explain why you're not seeing an answer for the "Sachin Tendulkar" query when your database is more populated.

As for the expected behavior, it's indeed reasonable to expect that the model would not provide an answer if the query is not mentioned in the database. However, RAG models are designed to generate responses even when the exact query is not present in the database. They do this by leveraging the semantic similarity between the query and the documents in the database.

To better control the behavior of the model, you might want to adjust the retrieval and generation parameters. For example, you could increase the retrieval threshold to ensure that only highly relevant documents are retrieved. You could also adjust the generation parameters to control the length and diversity of the generated responses.

I hope this helps clarify the behavior you're observing. If you have any further questions, please don't hesitate to ask.

Best regards, Dosu

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

dosubot[bot] commented 11 months ago

Hi, @aiquick! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were experiencing unexpected behavior when using the RAG example with a custom database. The system was providing an answer and reference link even though the database did not contain any information about the query. I explained that this behavior is due to the nature of the RAG model used in the LangChain framework, which retrieves relevant documents from the database and generates responses based on them. I suggested that adjusting the retrieval and generation parameters can help control the behavior of the model.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] commented 11 months ago

Hi, @aiquick! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were experiencing unexpected behavior when using the RAG example with a custom database. The system was providing an answer and reference link even though the database did not contain any information about the query. I explained that this behavior is due to the nature of the RAG model used in the LangChain framework, which retrieves relevant documents from the database and generates responses based on them. I suggested that adjusting the retrieval and generation parameters can help control the behavior of the model.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

langchain-ai / langchain