Getting unwanted/irrelevant/out of context results in document question answering

pradeepdev-1995 commented 1 year ago

System Info

langchain==0.0.219 Python3.9

Who can help?

No response

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] LLMs/Chat Models
[X] Embedding Models
[X] Prompts / Prompt Templates / Prompt Selectors
[X] Output Parsers
[X] Document Loaders
[X] Vector Stores / Retrievers
[X] Memory
[X] Agents / Agent Executors
[X] Tools / Toolkits
[X] Chains
[X] Callbacks/Tracing
[X] Async

Reproduction

from langchain.document_loaders import DirectoryLoader
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import AzureOpenAI
import os
import openai

llm = AzureOpenAI(
    openai_api_base=os.getenv("OPENAI_API_BASE"),
    openai_api_version="version",
    deployment_name="deployment name",
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    openai_api_type="azure",
)

directory = '/Data'
def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)

def split_docs(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vector_store = FAISS.from_documents(docs, embeddings)

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True
)
while True:
    query = input("Input your question\n")
    result = chain(query)
    print("Answer:\n")
    print(result['answer'])

Tried the above code which is based on the Retrieval Augmented Generation pipeline. I tried with different configurations of vectordbs(chroma,pinecone,fiass,weaviate...etc), different configurations of embedding methods(openai embeddings,huggingface embedding,sentencetransformer embedding,...etc) , and also different configurations of LLMs(openai,Azureopenai,Cohere, Huggingface model...etc).

But in all the above cases I am observing some major/critical miss behaviors in some times

1 - When I ask questions related to the document that I provided(in the pdf which was embedded and stored in the vector store), sometimes I am getting the expected answers from the document - which are the expected behaviors that should occur always.

2 - But When I ask questions related to the document that I provided, sometimes I am getting the answers which are out of the document.

3 - And When I ask questions related to the document that I provided, I get the correct answers from the document and also the outer world answers

4 - Also If I ask questions that are not related to this document, still I am getting answers from the outer world(I am expecting an answer such as - "I don't know, the question is beyond my knowledge" from the chain

5 - Sometimes I am getting the internal states (Agent response, human response, training data context, internal output, langchain prompt, answer containing page number with full context, partial intermediate answers......)- which I don't want to look, along with the output -

6 - Finally each time I am getting different results for the same question.

Tried verbose= False. But still getting some unwanted details(along with the exact answer) which makes the bot noisy. How to prevent it?

Expected behavior

When I ask questions related to the document that I provided, it must return the most relevant answer without any other info like internal states, prompts...etc.

Also if I ask questions that are not related to the document that I provided it shuld return "I don't know, the question is beyond my knowledge"

pradeepdev-1995 commented 1 year ago

@hwchase17 any update on this?

pradeepdev-1995 commented 1 year ago

@agola11 @eyurtsev can you please help me on this?

afnanhabib787 commented 1 year ago

following

orhansonmeztr commented 1 year ago

@pradeepdev-1995 @hwchase17 @eyurtsev +1 I have the same problem. In particular, it's frustrating to get (mostly) correct answers derived from outside of context when asked a question that is not related to the context.

dosubot[bot] commented 11 months ago

Hi, @pradeepdev-1995! I'm Dosu, and I'm here to help the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you reported is about unwanted and irrelevant results in document question answering using the Retrieval Augmented Generation pipeline. It seems like you and another user have experienced issues such as getting answers from outside the document and getting internal states and prompts along with the output. There have been requests for updates and assistance from other users as well.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself. If we don't receive any response within 7 days, the issue will be automatically closed.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, please don't hesitate to ask.

Best regards, Dosu

premsai079 commented 11 months ago

Hi Pradeep, did you find any solution on how to restrict out of context answers?

pradeepdev-1995 commented 11 months ago

@premsai079 No still sometimes I am getting out-of-context answers.

premsai079 commented 11 months ago

I created a chat bot application and I am also facing the same issue if you find any solution please let me know

langchain-ai / langchain