The RetrievalQAWithSourcesChain doesn't return SOURCES.

portswigger-katie commented 9 months ago

System Info

I have a question&answer over docs chatbot application, that uses the RetrievalQAWithSourcesChain and ChatPromptTemplate. In langchain version 0.0.238 it used to return sources but this seems to be broken in the releases since then. Python version: Python 3.11.4 LangChain version: 0.0.287 Example response with missing sources:

Entering new RetrievalQAWithSourcesChain chain...

Finished chain. {'question': 'what is sql injection', 'answer': 'SQL injection is a web security vulnerability that allows an attacker to interfere with the queries that an application makes to its database. By manipulating the input data, an attacker can execute their own malicious SQL queries, which can lead to unauthorized access, data theft, or modification of the database. This vulnerability can be exploited to view sensitive data, modify or delete data, or even take control of the database server. SQL injection is a serious issue that can result in high-profile data breaches and compromises of user accounts. It is important for developers to implement proper input validation and parameterized queries to prevent SQL injection attacks.\n\n', 'sources': ''}

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[X] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

 import pickle
import gradio as gr
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import PromptLayerChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

pkl_file_path = "faiss_store.pkl"

event = {"question": "what is sql injection"}

system_template = """
Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write "I could not find an answer."
If you don't know the answer, just say "Hmm..., I'm not sure.", don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.
The "SOURCES" part should be a reference to the source of the document from which you got your answer.

Example of your response should be:

The answer is foo
SOURCES:
1. abc
2. xyz

Begin!
----------------
{summaries}
"""

def get_chain(store: FAISS, prompt_template: ChatPromptTemplate):
    return RetrievalQAWithSourcesChain.from_chain_type(
        PromptLayerChatOpenAI(
            pl_tags=["burpbot"],
            temperature=0,
        ),
        chain_type="stuff",
        retriever=store.as_retriever(),
        chain_type_kwargs={"prompt": prompt_template},
        reduce_k_below_max_tokens=True,
        verbose=True,
    )

def create_prompt_template() -> ChatPromptTemplate:
    return ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(system_template),
            HumanMessagePromptTemplate.from_template("{question}"),
        ]
    )

def load_remote_faiss_store() -> FAISS:
    with open(pkl_file_path, "rb") as f:
        return pickle.load(f)

def main() -> dict:
    prompt_template = create_prompt_template()
    store: FAISS = load_remote_faiss_store()
    chain = get_chain(store, prompt_template)
    result = chain(event)
    print(result)

Expected behavior

expected output:

{'question': 'what is sql injection', 'answer': 'SQL injection is a web security vulnerability that allows an attacker to interfere with the queries that an application makes to its database. By manipulating the input data, an attacker can execute their own malicious SQL queries, which can lead to unauthorized access, data theft, or modification of the database. This vulnerability can be exploited to view sensitive data, modify or delete data, or even take control of the database server. SQL injection is a serious issue that can result in high-profile data breaches and compromises of user accounts. It is important for developers to implement proper input validation and parameterized queries to prevent SQL injection attacks.\n\n', 'sources': 'https://example.net/web-security/sql-injection'}

dosubot[bot] commented 9 months ago

🤖

Hello,

Thank you for bringing this to our attention. Based on the context provided, it seems like the RetrievalQAWithSourcesChain is designed to separate the answer from the sources. This is done by the _split_sources(text) method, which takes a text as input and returns two outputs: the answer and the sources. The sources are not included in the answer.

Here is the relevant code from the unit test test_spliting_answer_into_answer_and_sources:

def test_spliting_answer_into_answer_and_sources(
    text: str, answer: str, sources: str
) -> None:
    qa_chain = QAWithSourcesChain.from_llm(FakeLLM())
    generated_answer, generated_sources = qa_chain._split_sources(text)
    assert generated_answer == answer
    assert generated_sources == sources

In this code, qa_chain._split_sources(text) is called with a text containing an answer and sources. The method returns generated_answer and generated_sources, which are then compared with the expected answer and sources respectively. The test passes if the generated answer and sources match the expected answer and sources, indicating that the method correctly separates the answer from the sources.

You can find this test in the libs/langchain/tests/unit_tests/chains/test_qa_with_sources.py file.

However, I wasn't able to find specific changes made to the RetrievalQAWithSourcesChain in the LangChain updates after version 0.0.238 that could have affected the return of sources.

As a workaround, you might want to try setting the return_source_documents parameter to True when creating the ConversationalRetrievalChain as suggested in a similar issue:

chain = ConversationalRetrievalChain(
    retriever=vectorstore.as_retriever(),
    question_generator=question_generator,
    combine_docs_chain=doc_chain,
    return_source_documents=True
)

Then, you can access the source documents from the result:

print(result["source_documents"][0])

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

duri0214 commented 9 months ago

0.0.297 Thank you to all the langchain developers Wow, the same problem as me is on the issue. I use return_source_documents=True, metadata will be returned, so it seems like a good idea to extract this using a comprehension for now.

        chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm,
                                                            chain_type="stuff",
                                                            reduce_k_below_max_tokens=True,
+                                                           return_source_documents=True,
                                                            retriever=docsearch.as_retriever(),
                                                            chain_type_kwargs={"prompt": self.prompt_template})

{'question': '晩婚化について教えて', 'answer': '晩婚化については...います。\n\n（参考資料: 令和4年版少子化社会対策白書全体版（PDF版）.pdf 1ページ、4ページ）', 'sources': '', 'source_documents': [Document(page_content='晩婚化....晩', metadata={'source': '令和4年版少子化社会対策白書全体版（PDF版）.pdf 4ページ'}), Document(page_content='１\u300....:22:22', metadata={'source': '令和4年版少子化社会対策白書全体版（PDF版）.pdf 3ページ'}), Document(page_content='年齢（５...対策白書2', metadata={'source': '令和4年版少子化社会対策白書全体版（PDF版）.pdf 5ページ'}), Document(page_content='未婚....22:22', metadata={'source': '令和4年版少子化社会対策白書全体版（PDF版）.pdf 2ページ'})]}

portswigger-katie commented 8 months ago

I got the same issue and I'm using the 0.0.325 LangChain version

If you need to you can fix the problem in your local code base by replacing this line. That's what I've been doing while waiting for #12556 to be merged into a release.

libs/langchain/langchain/chains/qa_with_sources/base.py

r"SOURCES?:|QUESTION:\s", answer, flags=re.IGNORECASE
r"SOURCES?:\s*|QUESTION:\s", answer, flags=re.IGNORECASE

SuperHao-Wu commented 8 months ago

you need to make sure 'source' appears in the metadata of your vector store.

dosubot[bot] commented 4 months ago

Hi, @portswigger-katie,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the issue "RetrievalQAWithSourcesChain not returning sources as expected" was observed in langchain version 0.0.287, where the sources were missing from the output. There were detailed explanations provided by me and suggestions for a workaround using the return_source_documents parameter. Additionally, duri0214 encountered the same problem and shared their approach of using return_source_documents=True. It seems that the issue has been resolved by using the return_source_documents parameter to retrieve the expected sources in the output, and a local fix was used while waiting for a specific pull request to be merged. SuperHao-Wu also advised ensuring that 'source' appears in the metadata of the vector store.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!

langchain-ai / langchain