langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.07k stars 14.48k forks source link

Unable to return output from every source (i.e. every document), rather it is returning only one output even if there are multiple documents #16938

Closed nithinreddyyyyyy closed 6 months ago

nithinreddyyyyyy commented 6 months ago

Issue with current documentation:

below's the complete code

# !pip -q install langchain openai tiktoken chromadb pypdf sentence-transformers==2.2.2 InstructorEmbedding faiss-cpu

import os

os.environ["OPENAI_API_KEY"] = ""

from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings

"""### Load Multiple files from Directory"""

root_dir = "/content/data"

# List of file paths for your CSV files
csv_files = ['one.csv', '1.csv', 'one-no.csv', 'one-yes.csv']

# Iterate over the file paths and create a loader for each file
loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files]

# Now, loaders is a list of CSVLoader instances, one for each file

# Optional: If you need to combine the data from all loaders
documents = []
for loader in loaders:
    data = loader.load()  # or however you retrieve data from the loader
    documents.extend(data)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200)

texts = text_splitter.split_documents(documents)

len(texts)

import pickle
import faiss
from langchain.vectorstores import FAISS

def store_embeddings(docs, embeddings, sotre_name, path):
    vectorStore = FAISS.from_documents(docs, embeddings)

    with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
        pickle.dump(vectorStore, f)

def load_embeddings(sotre_name, path):
    with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
        VectorStore = pickle.load(f)
    return VectorStore

"""### HF Instructor Embeddings"""

from langchain.embeddings import HuggingFaceInstructEmbeddings
# from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from InstructorEmbedding import INSTRUCTOR

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
                                                      model_kwargs={"device": "cuda"})

Embedding_store_path = f"{root_dir}/Embedding_store"

db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)

retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 5})

retriever.search_type

retriever.search_kwargs

docs = retriever.get_relevant_documents("Can you tell me about natixis risk mapping?")

docs[0]

# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2),
                                                    chain_type="stuff",
                                                    retriever=retriever,
                                                    return_source_documents=True)

"""### OpenAI's Embeddings"""

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 2})

# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
                                              chain_type="stuff",
                                              retriever=retriever_openai,
                                              return_source_documents=True)

"""### Testing both MODELS"""

## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

# def process_llm_response(llm_response):
#     print(wrap_text_preserve_newlines(llm_response['result']))
#     print('\nSources:')
#     for source in llm_response["source_documents"]:
#         print(source.metadata['source'])

def process_llm_response(llm_response):
    # print('\nSources:')
    # print("\n")
    if llm_response["source_documents"]:
        for source in llm_response["source_documents"]:
            print(wrap_text_preserve_newlines(source.page_content))
            source_name = source.metadata['source']
            row_number = source.metadata.get('row', 'Not specified')
            print(f"Source: {source_name}, Row: {row_number}\n")
    else:
        print("No sources available.")

query = 'Can you tell me about natixis risk mapping??'

print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)

the above is giving below output

-------------------Instructor Embeddings------------------

Snippet: Natixis conducted a systematic identification and materiality assessment of climate risk impacts.
This exercise leveraged existing Natixis risk mapping and relied on a qualitative analysis of the materiality
of impacts by Environmental and Social Responsibility and risk experts in the short medium term ( 5 years) and
long term (5.30 years). The analysis led to distinguish between indirect impactsresulting from Natixis
exposure to other entities (clientsassetsetc.) exposed to climate risksand direct impacts to which Natixis is
exposed through its own activities.
Source: conflicts.csv, Row: 14

Snippet: All risksincluding climate related risksare identified and evaluated at the regional level with the
help of regional experts. They cover the entire range of climate related issues (transitional and physical
issues). Risks are assessed on a gross risk basis. Gross risk is defined as risk without mitigation controls.
The risks are analyzed according to the criteria “EBIT effect� and “probability.�
Source: conflicts.csv, Row: 13

Snippet: Wärtsilä identifies and assesses on an annual basis its sustainability risksincluding climate
change risksin both its strategic and operative risk assessments.
Source: conflicts.csv, Row: 16

Snippet: Climate risk has been identified as one of the most significant risks.
Source: conflicts.csv, Row: 50

Snippet: Impact & implication Aurubis is since 2013 part of the EU-ETS.
Source: conflicts1.csv, Row: 17

it is returning multiple outputs from the same source, but i was expecting one output from each and every source document. Seems like multiple snippets from the same source should be combined and based out of that combined text it should return that one output for that source. Can you please look into this?

Idea or request for content:

No response

dosubot[bot] commented 6 months ago

🤖

Hey @nithinreddyyyyyy! 🚀 Great to see you back with another intriguing puzzle for us to solve together. 😄

Based on your request, it seems you want to modify the LangChain code to return one output from each source document instead of multiple outputs from the same source. This can be achieved by modifying the combine_docs and acombine_docs methods in the StuffDocumentsChain class.

Currently, these methods combine all documents into one prompt and pass it to the language model. Instead, you should iterate over the documents, generate a prompt for each one, and pass each prompt to the language model separately.

Here's how you can modify the combine_docs method:

def combine_docs(
    self, docs: List[Document], callbacks: Callbacks = None, **kwargs: Any
) -> List[Tuple[str, dict]]:
    """Generate a separate prompt for each document and pass to LLM.

    Args:
        docs: List of documents to join together into one variable
        callbacks: Optional callbacks to pass along
        **kwargs: additional parameters to use to get inputs to LLMChain.

    Returns:
        A list of tuples. Each tuple contains the single string output for a document 
        and a dictionary of other keys to return.
    """
    results = []
    for doc in docs:
        inputs = self._get_inputs([doc], **kwargs)
        # Call predict on the LLM.
        result = self.llm_chain.predict(callbacks=callbacks, **inputs), {}
        results.append(result)
    return results

And here's how you can modify the acombine_docs method:

async def acombine_docs(
    self, docs: List[Document], callbacks: Callbacks = None, **kwargs: Any
) -> List[Tuple[str, dict]]:
    """Async generate a separate prompt for each document and pass to LLM.

    Args:
        docs: List of documents to join together into one variable
        callbacks: Optional callbacks to pass along
        **kwargs: additional parameters to use to get inputs to LLMChain.

    Returns:
        A list of tuples. Each tuple contains the single string output for a document 
        and a dictionary of other keys to return.
    """
    results = []
    for doc in docs:
        inputs = self._get_inputs([doc], **kwargs)
        # Call predict on the LLM.
        result = await self.llm_chain.apredict(callbacks=callbacks, **inputs), {}
        results.append(result)
    return results

These modifications will ensure that each document is processed separately and a separate output is generated for each document. You can find the original implementation of these methods in the libs/langchain/langchain/chains/combine_documents/stuff.py file in the LangChain repository.

Please note that these changes will affect the behavior of the StuffDocumentsChain class and any other classes that use it. Make sure to thoroughly test your code after making these modifications to ensure it behaves as expected.

Let me know if you need further assistance.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.