langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.06k stars 14.48k forks source link

Unable to return output from every souce (i.e. every document), rather it is returning only one output even if there are multiple documents #16935

Closed nithinreddyyyyyy closed 6 months ago

nithinreddyyyyyy commented 6 months ago

Issue with current documentation:

in the below code, it returns only one answer even if there are multiple documents (multiple csv files). I'm bit skeptical on which line of code should i make changes to get the output for every answer.

# !pip -q install langchain openai tiktoken chromadb pypdf sentence-transformers==2.2.2 InstructorEmbedding faiss-cpu

import os

os.environ["OPENAI_API_KEY"] = ""

from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

"""### Load Multiple files from Directory"""

root_dir = "/content/data"

# List of file paths for your CSV files
csv_files = ['one.csv', '1.csv', 'one-no.csv', 'one-yes.csv']

# Iterate over the file paths and create a loader for each file
loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files]

# Now, loaders is a list of CSVLoader instances, one for each file

# Optional: If you need to combine the data from all loaders
documents = []
for loader in loaders:
    data = loader.load()  # or however you retrieve data from the loader
    documents.extend(data)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200)

texts = text_splitter.split_documents(documents)

len(texts)

import pickle
import faiss
from langchain.vectorstores import FAISS

def store_embeddings(docs, embeddings, sotre_name, path):
    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

"""### HF Instructor Embeddings"""

from langchain.embeddings import HuggingFaceInstructEmbeddings
# from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from InstructorEmbedding import INSTRUCTOR

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})

Embedding_store_path = f"{root_dir}/Embedding_store"

db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 5})

retriever.search_type

retriever.search_kwargs

docs = retriever.get_relevant_documents("Can you tell me about natixis risk mapping?")

docs[0]

# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2),
                                                    chain_type="stuff",
                                                    retriever=retriever,
                                                    return_source_documents=True)

"""### OpenAI's Embeddings"""

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 2})

# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                              chain_type="stuff",
                                              retriever=retriever_openai,
                                              return_source_documents=True)

"""### Testing both MODELS"""

## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

# def process_llm_response(llm_response):
#     print(wrap_text_preserve_newlines(llm_response['result']))
#     print('\nSources:')
#     for source in llm_response["source_documents"]:
#         print(source.metadata['source'])

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\nSources:')

    if llm_response["source_documents"]:
        # Access the first source document
        first_source = llm_response["source_documents"][0]
        source_name = first_source.metadata['source']
        row_number = first_source.metadata.get('row', 'Not specified')

        # Print the first source's file name and row number
        print(f"{source_name}, Row: {row_number}")
    else:
        print("No sources available.")

query = 'Can you tell me about natixis risk mapping??'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

query = 'Can you tell me about natixis risk mapping??'

print('-------------------OpenAI Embeddings------------------\n')
llm_response = qa_chain_openai(query)
process_llm_response(llm_response)

Below is the actual output

query = 'Can you tell me about natixis risk mapping??'

print('-------------------Instructor Embeddings------------------\n') llm_response = qa_chain_instrucEmbed(query) process_llm_response(llm_response)

-------------------Instructor Embeddings------------------

Answer:
 Natixis conducts a systematic identification and materiality assessment of climate risk impacts through their
risk mapping process. This involves evaluating all risks, including climate related risks, at the regional
level with the help of regional experts. The risks are assessed on a gross risk basis, meaning without
mitigation controls, and are analyzed according to the criteria "EBIT effect" and "probability." This process
also distinguishes between indirect impacts resulting from Natixis' exposure to other entities and direct
impacts from their own activities.

Sources:
Source 1: one.csv, Row: 14
Source 2: 1.csv, Row: 13
Source 3: one-no.csv, Row: 16
Source 4: one-yes.csv, Row: 50

Expected output:

Answer:
 Natixis conducts a systematic identification and materiality assessment of climate risk impacts through their
risk mapping process. This involves evaluating all risks, including climate related risks, at the regional
level with the help of regional experts. The risks are assessed on a gross risk basis, meaning without
mitigation controls, and are analyzed according to the criteria "EBIT effect" and "probability." This process
also distinguishes between indirect impacts resulting from Natixis' exposure to other entities and direct
impacts from their own activities.

Sources:
Source: one.csv, Row: 14

Answer:
 I'm not  sure.

Sources:
Source: 1.csv, Row: 13

so on

it has returned only 1 answer for multiple sources I need answers for each and every source. Can anyone please help me constructing the code?

Idea or request for content:

No response

dosubot[bot] commented 6 months ago

🤖

Hey @nithinreddyyyyyy! Great to see you back here with another interesting challenge. How have you been?

Based on the provided context, it seems like you want to get an answer for each source document when there are multiple sources. However, the current implementation of the RetrievalQA class in the LangChain codebase retrieves relevant documents based on a given question and then combines them into a single answer.

To modify this behavior, you would need to change the way the RetrievalQA class processes and combines the documents. Specifically, you would need to modify the methods in the RetrievalQA class that process and combine the documents to ensure they return an answer for each document, rather than combining them into a single answer.

Here is a simplified example of how you might modify the process_llm_response function to print an answer for each source document:

def process_llm_response(llm_response):
    print('\nSources:')
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source['result']))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

In this modified version of the function, we loop over each source document in llm_response["source_documents"] and print the result and source information for each one.

Please note that this is a simplified example and may not work as expected without further modifications. The actual implementation would depend on the specific details of the RetrievalQA class and the structure of the llm_response object.

I hope this helps! If you have any further questions, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nithinreddyyyyyy commented 6 months ago

🤖

Hey @nithinreddyyyyyy! Great to see you back here with another interesting challenge. How have you been?

Based on the provided context, it seems like you want to get an answer for each source document when there are multiple sources. However, the current implementation of the RetrievalQA class in the LangChain codebase retrieves relevant documents based on a given question and then combines them into a single answer.

To modify this behavior, you would need to change the way the RetrievalQA class processes and combines the documents. Specifically, you would need to modify the methods in the RetrievalQA class that process and combine the documents to ensure they return an answer for each document, rather than combining them into a single answer.

Here is a simplified example of how you might modify the process_llm_response function to print an answer for each source document:

def process_llm_response(llm_response):
    print('\nSources:')
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source['result']))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

In this modified version of the function, we loop over each source document in llm_response["source_documents"] and print the result and source information for each one.

Please note that this is a simplified example and may not work as expected without further modifications. The actual implementation would depend on the specific details of the RetrievalQA class and the structure of the llm_response object.

I hope this helps! If you have any further questions, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

It has returned below error

-------------------Instructor Embeddings------------------

/usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
  warn_deprecated(

Sources:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-40-47fd43038026>](https://localhost:8080/#) in <cell line: 5>()
      3 print('-------------------Instructor Embeddings------------------\n')
      4 llm_response = qa_chain_instrucEmbed(query)
----> 5 process_llm_response(llm_response)

[<ipython-input-39-22e6fadcdec8>](https://localhost:8080/#) in process_llm_response(llm_response)
     40     if llm_response["source_documents"]:
     41         for source in llm_response["source_documents"]:
---> 42             print(wrap_text_preserve_newlines(source['result']))
     43             source_name = source.metadata['source']
     44             row_number = source.metadata.get('row', 'Not specified')

TypeError: 'Document' object is not subscriptable

how to overcome this?

dosubot[bot] commented 6 months ago

🤖

I see you're encountering a TypeError: 'Document' object is not subscriptable error. This is because the Document object in LangChain is not a dictionary or list that you can index into with square brackets. Instead, it's an instance of the Document class, which has attributes that you can access using dot notation.

In your case, you're trying to access the 'result' from each source document. However, the Document object doesn't have a 'result' attribute. Instead, it has a page_content attribute that stores the text of the document, and a metadata attribute that stores arbitrary metadata about the page content.

Here's how you can modify your process_llm_response function to print the page content and metadata for each source document:

def process_llm_response(llm_response):
    print('\nSources:')
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source.page_content))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

In this modified version of the function, we're using dot notation to access the page_content and metadata attributes of each source document.

Please note that this will print the page content of each source document, not the answer generated by the language model. If you want to generate a separate answer for each source document, you would need to modify the RetrievalQA class to process each document separately, which would involve a more significant change to the codebase.

I hope this helps! If you have any further questions, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nithinreddyyyyyy commented 6 months ago

🤖

I see you're encountering a TypeError: 'Document' object is not subscriptable error. This is because the Document object in LangChain is not a dictionary or list that you can index into with square brackets. Instead, it's an instance of the Document class, which has attributes that you can access using dot notation.

In your case, you're trying to access the 'result' from each source document. However, the Document object doesn't have a 'result' attribute. Instead, it has a page_content attribute that stores the text of the document, and a metadata attribute that stores arbitrary metadata about the page content.

Here's how you can modify your process_llm_response function to print the page content and metadata for each source document:

def process_llm_response(llm_response):
    print('\nSources:')
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source.page_content))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

In this modified version of the function, we're using dot notation to access the page_content and metadata attributes of each source document.

Please note that this will print the page content of each source document, not the answer generated by the language model. If you want to generate a separate answer for each source document, you would need to modify the RetrievalQA class to process each document separately, which would involve a more significant change to the codebase.

I hope this helps! If you have any further questions, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

But it has returned multiple outputs from same source and even it has missed some sources, below's the output

-------------------Instructor Embeddings------------------

Snippet: Natixis conducted a systematic identification and materiality assessment of climate risk impacts.
This exercise leveraged existing Natixis risk mapping and relied on a qualitative analysis of the materiality
of impacts by Environmental and Social Responsibility and risk experts in the short medium term ( 5 years) and
long term (5.30 years). The analysis led to distinguish between indirect impactsresulting from Natixis
exposure to other entities (clientsassetsetc.) exposed to climate risksand direct impacts to which Natixis is
exposed through its own activities.
Source: conflicts.csv, Row: 14

Snippet: All risksincluding climate related risksare identified and evaluated at the regional level with the
help of regional experts. They cover the entire range of climate related issues (transitional and physical
issues). Risks are assessed on a gross risk basis. Gross risk is defined as risk without mitigation controls.
The risks are analyzed according to the criteria “EBIT effect” and “probability.”
Source: conflicts.csv, Row: 13

Snippet: Wärtsilä identifies and assesses on an annual basis its sustainability risksincluding climate
change risksin both its strategic and operative risk assessments.
Source: conflicts.csv, Row: 16

Snippet: Climate risk has been identified as one of the most significant risks.
Source: conflicts.csv, Row: 50

Snippet: Impact & implication Aurubis is since 2013 part of the EU-ETS.
Source: conflicts1.csv, Row: 17

I was expecting one answer from every source, for instance 5 answers from 5 sources. If there are multiple vector counts for one source, combine them into one snippet and return the output from them.

dosubot[bot] commented 6 months ago

🤖

Based on the error message you're receiving, it seems like you're trying to access the 'Document' object as if it were a subscriptable object like a list or dictionary. However, as per the structure of the 'Document' object in the LangChain framework, it's a class with three attributes: page_content, metadata, and type. You can access these attributes directly. For example, if doc is a 'Document' object, you can access the page content with doc.page_content.

If you're trying to get the 'result' from each source document, you might be trying to access an attribute that doesn't exist in the 'Document' object. Could you clarify what you mean by 'result'? If 'result' is supposed to be a part of the metadata dictionary, you can access it with doc.metadata['result'], given that doc is a 'Document' object and 'result' is a key in the metadata dictionary.

Regarding the issue with the process_llm_response function, without the actual code or a more detailed description of its functionality, it's hard to provide a specific solution. However, based on your description, it seems like you're trying to process each document separately and then combine the results. One possible issue could be that you're not correctly iterating over each document. Make sure you're correctly iterating over your set of documents and processing each one individually.

As for handling multiple documents, the LangChain framework uses document loaders to load and process multiple documents. Each loader is designed to handle a specific type of document or data source. If you're loading multiple CSV files, you might want to use or create a loader specifically designed for this task.

If you're still having trouble, could you provide more details about your code, particularly the part where you're loading and processing the documents and the process_llm_response function? This would help in providing a more specific solution to your problem.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nithinreddyyyyyy commented 6 months ago

below's the complete code

# !pip -q install langchain openai tiktoken chromadb pypdf sentence-transformers==2.2.2 InstructorEmbedding faiss-cpu

import os

os.environ["OPENAI_API_KEY"] = ""

from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

"""### Load Multiple files from Directory"""

root_dir = "/content/data"

# List of file paths for your CSV files
csv_files = ['one.csv', '1.csv', 'one-no.csv', 'one-yes.csv']

# Iterate over the file paths and create a loader for each file
loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files]

# Now, loaders is a list of CSVLoader instances, one for each file

# Optional: If you need to combine the data from all loaders
documents = []
for loader in loaders:
    data = loader.load()  # or however you retrieve data from the loader
    documents.extend(data)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200)

texts = text_splitter.split_documents(documents)

len(texts)

import pickle
import faiss
from langchain.vectorstores import FAISS

def store_embeddings(docs, embeddings, sotre_name, path):
    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

"""### HF Instructor Embeddings"""

from langchain.embeddings import HuggingFaceInstructEmbeddings
# from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from InstructorEmbedding import INSTRUCTOR

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})

Embedding_store_path = f"{root_dir}/Embedding_store"

db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 5})

retriever.search_type

retriever.search_kwargs

docs = retriever.get_relevant_documents("Can you tell me about natixis risk mapping?")

docs[0]

# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2),
                                                    chain_type="stuff",
                                                    retriever=retriever,
                                                    return_source_documents=True)

"""### OpenAI's Embeddings"""

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 2})

# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                              chain_type="stuff",
                                              retriever=retriever_openai,
                                              return_source_documents=True)

"""### Testing both MODELS"""

## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

# def process_llm_response(llm_response):
#     print(wrap_text_preserve_newlines(llm_response['result']))
#     print('\nSources:')
#     for source in llm_response["source_documents"]:
#         print(source.metadata['source'])

def process_llm_response(llm_response):
    # print('\nSources:')
    # print("\n")
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source.page_content))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

query = 'Can you tell me about natixis risk mapping??'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

the above is giving below output

-------------------Instructor Embeddings------------------

Snippet: Natixis conducted a systematic identification and materiality assessment of climate risk impacts.
This exercise leveraged existing Natixis risk mapping and relied on a qualitative analysis of the materiality
of impacts by Environmental and Social Responsibility and risk experts in the short medium term ( 5 years) and
long term (5.30 years). The analysis led to distinguish between indirect impactsresulting from Natixis
exposure to other entities (clientsassetsetc.) exposed to climate risksand direct impacts to which Natixis is
exposed through its own activities.
Source: conflicts.csv, Row: 14

Snippet: All risksincluding climate related risksare identified and evaluated at the regional level with the
help of regional experts. They cover the entire range of climate related issues (transitional and physical
issues). Risks are assessed on a gross risk basis. Gross risk is defined as risk without mitigation controls.
The risks are analyzed according to the criteria “EBIT effect� and “probability.�
Source: conflicts.csv, Row: 13

Snippet: Wärtsilä identifies and assesses on an annual basis its sustainability risksincluding climate
change risksin both its strategic and operative risk assessments.
Source: conflicts.csv, Row: 16

Snippet: Climate risk has been identified as one of the most significant risks.
Source: conflicts.csv, Row: 50

Snippet: Impact & implication Aurubis is since 2013 part of the EU-ETS.
Source: conflicts1.csv, Row: 17

it is returning multiple outputs from the same source, but i was expecting one output from each and every source document. Seems like multiple snippets from the same source should be combined and based out of that combined text it should return that one output for that source. Can you please look into this?

AlexanderCohen commented 5 months ago

This is like a peek into someones chatGPT history.