RetrievalQAWithSourcesChain provides unreliable sources

System Info

langchain.__version__ is 0.0.184
Python 3.11
Mac OS Ventura 13.3.1(a)

Who can help?

@hwchase17

Summary

The sources component of the output of RetrievalQAWithSourcesChain is not providing transparency into what documents the retriever returns, it is instead some output that the llm contrives.

Motivation

From my perspective, the primary advantage of having visibility into sources is to allow the system to provide transparency into the documents that were retrieved in assisting the language model to generate its answer. Only after being confused for quite a while and inspecting the code did I realize that the sources were just being conjured up.

Advice

I think it is important to ensure that people know about this, as maybe this isn't a bug and is more documentation-related, though I think the docstring should be updated as well.

Notes

Document Retrieval Works very well.

It's worth noting that in this toy example, the combination of FAISS vector store and the OpenAIEmbeddings embeddings model are doing very reasonably, and are deterministic.

Recommendation

Add caveats everywhere. Frankly, I would never trust using this chain. I literally had an example the other day where it wrongly made up a source and a wikipedia url that had absolutely nothing to do with the documents retrieved. I could supply this example as it is a way better illustration of how this chain will hallucinate sources because they are generated by the LLM, but it's just a little bit more involved than this smaller example.

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[X] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Demonstrative Example

Here's the simplest example I could come up with:

1. Instantiate a `vectorstore` with 7 documents displayed below.

>>> from langchain.vectorstores import FAISS
>>> from langchain.embeddings import OpenAIEmbeddings
>>> from langchain.llms import OpenAI
>>> from langchain.chains import RetrievalQAWithSourcesChain

>>> chars = ['a', 'b', 'c', 'd', '1', '2', '3']
>>> texts = [4*c for c in chars]
>>> metadatas = [{'title': c, 'source': f'source_{c}'} for c in chars]

>>> vs = FAISS.from_texts(texts, embedding=OpenAIEmbeddings(), metadatas=metadatas)
>>> retriever = vs.as_retriever(search_kwargs=dict(k=5))
>>> vs.docstore._dict

{'0ec43ce4-6753-4dac-b72a-6cf9decb290e': Document(page_content='aaaa', metadata={'title': 'a', 'source': 'source_a'}),
 '54baed0b-690a-4ffc-bb1e-707eed7da5a1': Document(page_content='bbbb', metadata={'title': 'b', 'source': 'source_b'}),
 '85b834fa-14e1-4b20-9912-fa63fb7f0e50': Document(page_content='cccc', metadata={'title': 'c', 'source': 'source_c'}),
 '06c0cfd0-21a2-4e0c-9c2e-dd624b5164fe': Document(page_content='dddd', metadata={'title': 'd', 'source': 'source_d'}),
 '94d6444f-96cd-4d88-8973-c3c0b9bf0c78': Document(page_content='1111', metadata={'title': '1', 'source': 'source_1'}),
 'ec04b042-a4eb-4570-9ee9-a2a0bd66a82e': Document(page_content='2222', metadata={'title': '2', 'source': 'source_2'}),
 '0031d3fc-f291-481e-a12a-9cc6ed9761e0': Document(page_content='3333', metadata={'title': '3', 'source': 'source_3'})}

2. Instantiate a `RetrievalQAWithSourcesChain`

The return_source_documents is set to True so that we can inspect the actual sources retrieved.

>>> qa_sources = RetrievalQAWithSourcesChain.from_chain_type(
    OpenAI(), 
    retriever=retriever, 
    return_source_documents=True
)

3. Example Question

Things look sort of fine, meaning 5 documents are retrieved by the retriever, but the model only lists only a single source.

qa_sources('what is the first lower-case letter of the alphabet?')

{'question': 'what is the first lower-case letter of the alphabet?',
 'answer': ' The first lower-case letter of the alphabet is "a".\n',
 'sources': 'source_a',
 'source_documents': [Document(page_content='bbbb', metadata={'title': 'b', 'source': 'source_b'}),
  Document(page_content='aaaa', metadata={'title': 'a', 'source': 'source_a'}),
  Document(page_content='cccc', metadata={'title': 'c', 'source': 'source_c'}),
  Document(page_content='dddd', metadata={'title': 'd', 'source': 'source_d'}),
  Document(page_content='1111', metadata={'title': '1', 'source': 'source_1'})]}

4. Second Example Question containing the First Question.

This is not what I would expect, considering that this question contains the previous question, and that the vector store did supply the document with {'source': 'source_a'}, but for some reason (i.e. the internals of the output of OpenAI() ) in this response from the chain, there are zero sources listed.

>>> qa_sources('what is the one and only first lower-case letter and number of the alphabet and whole number system?')

{'question': 'what is the one and only first lower-case letter and number of the alphabet and whole number system?',
 'answer': ' The one and only first lower-case letter and number of the alphabet and whole number system is "a1".\n',
 'sources': 'N/A',
 'source_documents': [Document(page_content='1111', metadata={'title': '1', 'source': 'source_1'}),
  Document(page_content='bbbb', metadata={'title': 'b', 'source': 'source_b'}),
  Document(page_content='aaaa', metadata={'title': 'a', 'source': 'source_a'}),
  Document(page_content='2222', metadata={'title': '2', 'source': 'source_2'}),
  Document(page_content='cccc', metadata={'title': 'c', 'source': 'source_c'})]}

Expected behavior

I am not sure. We need a warning, perhaps, every time this chain is used, or some strongly worded documentation for our developers.

langchain-ai / langchain