Closed startakovsky closed 1 year ago
Hi, @startakovsky! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is regarding the RetrievalQAWithSourcesChain
component in the langchain
library. The issue was about the component not accurately providing sources for retrieved documents, which caused confusion about which documents were used to generate answers. The author recommended updating the documentation and adding warnings to address this issue.
It seems that there hasn't been any further activity or updates on this issue. Therefore, I wanted to check with you if this issue is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project. If you have any further questions or concerns, please let me know.
System Info
System Info
langchain.__version__
is0.0.184
Who can help?
@hwchase17
Summary
The sources component of the output of
RetrievalQAWithSourcesChain
is not providing transparency into what documents the retriever returns, it is instead some output that the llm contrives.Motivation
From my perspective, the primary advantage of having visibility into sources is to allow the system to provide transparency into the documents that were retrieved in assisting the language model to generate its answer. Only after being confused for quite a while and inspecting the code did I realize that the sources were just being conjured up.
Advice
I think it is important to ensure that people know about this, as maybe this isn't a bug and is more documentation-related, though I think the docstring should be updated as well.
Notes
Document Retrieval Works very well.
It's worth noting that in this toy example, the combination of
FAISS
vector store and theOpenAIEmbeddings
embeddings model are doing very reasonably, and are deterministic.Recommendation
Add caveats everywhere. Frankly, I would never trust using this chain. I literally had an example the other day where it wrongly made up a source and a wikipedia url that had absolutely nothing to do with the documents retrieved. I could supply this example as it is a way better illustration of how this chain will hallucinate sources because they are generated by the LLM, but it's just a little bit more involved than this smaller example.
Information
Related Components
Reproduction
Demonstrative Example
Here's the simplest example I could come up with:
1. Instantiate a
vectorstore
with 7 documents displayed below.2. Instantiate a
RetrievalQAWithSourcesChain
The
return_source_documents
is set toTrue
so that we can inspect the actual sources retrieved.3. Example Question
Things look sort of fine, meaning 5 documents are retrieved by the
retriever
, but the model only lists only a single source.4. Second Example Question containing the First Question.
This is not what I would expect, considering that this question contains the previous question, and that the vector store did supply the document with
{'source': 'source_a'}
, but for some reason (i.e. the internals of the output ofOpenAI()
) in this response from the chain, there are zero sources listed.Expected behavior
I am not sure. We need a warning, perhaps, every time this chain is used, or some strongly worded documentation for our developers.