Closed nithinreddyyyyyy closed 6 months ago
🤖
Hey @nithinreddyyyyyy! Great to see you back here with another interesting challenge. How have you been?
Based on the provided context, it seems like you want to get an answer for each source document when there are multiple sources. However, the current implementation of the RetrievalQA
class in the LangChain codebase retrieves relevant documents based on a given question and then combines them into a single answer.
To modify this behavior, you would need to change the way the RetrievalQA
class processes and combines the documents. Specifically, you would need to modify the methods in the RetrievalQA
class that process and combine the documents to ensure they return an answer for each document, rather than combining them into a single answer.
Here is a simplified example of how you might modify the process_llm_response
function to print an answer for each source document:
def process_llm_response(llm_response):
print('\nSources:')
if llm_response["source_documents"]:
for source in llm_response["source_documents"]:
print(wrap_text_preserve_newlines(source['result']))
source_name = source.metadata['source']
row_number = source.metadata.get('row', 'Not specified')
print(f"Source: {source_name}, Row: {row_number}\n")
else:
print("No sources available.")
In this modified version of the function, we loop over each source document in llm_response["source_documents"]
and print the result and source information for each one.
Please note that this is a simplified example and may not work as expected without further modifications. The actual implementation would depend on the specific details of the RetrievalQA
class and the structure of the llm_response
object.
I hope this helps! If you have any further questions, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hey @nithinreddyyyyyy! Great to see you back here with another interesting challenge. How have you been?
Based on the provided context, it seems like you want to get an answer for each source document when there are multiple sources. However, the current implementation of the
RetrievalQA
class in the LangChain codebase retrieves relevant documents based on a given question and then combines them into a single answer.To modify this behavior, you would need to change the way the
RetrievalQA
class processes and combines the documents. Specifically, you would need to modify the methods in theRetrievalQA
class that process and combine the documents to ensure they return an answer for each document, rather than combining them into a single answer.Here is a simplified example of how you might modify the
process_llm_response
function to print an answer for each source document:def process_llm_response(llm_response): print('\nSources:') if llm_response["source_documents"]: for source in llm_response["source_documents"]: print(wrap_text_preserve_newlines(source['result'])) source_name = source.metadata['source'] row_number = source.metadata.get('row', 'Not specified') print(f"Source: {source_name}, Row: {row_number}\n") else: print("No sources available.")
In this modified version of the function, we loop over each source document in
llm_response["source_documents"]
and print the result and source information for each one.Please note that this is a simplified example and may not work as expected without further modifications. The actual implementation would depend on the specific details of the
RetrievalQA
class and the structure of thellm_response
object.I hope this helps! If you have any further questions, please let me know.
Sources
- libs/langchain/langchain/chains/qa_with_sources/loading.py
- libs/langchain/langchain/chains/retrieval_qa/base.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
It has returned below error
-------------------Instructor Embeddings------------------
/usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
Sources:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-40-47fd43038026>](https://localhost:8080/#) in <cell line: 5>()
3 print('-------------------Instructor Embeddings------------------\n')
4 llm_response = qa_chain_instrucEmbed(query)
----> 5 process_llm_response(llm_response)
[<ipython-input-39-22e6fadcdec8>](https://localhost:8080/#) in process_llm_response(llm_response)
40 if llm_response["source_documents"]:
41 for source in llm_response["source_documents"]:
---> 42 print(wrap_text_preserve_newlines(source['result']))
43 source_name = source.metadata['source']
44 row_number = source.metadata.get('row', 'Not specified')
TypeError: 'Document' object is not subscriptable
how to overcome this?
🤖
I see you're encountering a TypeError: 'Document' object is not subscriptable
error. This is because the Document
object in LangChain is not a dictionary or list that you can index into with square brackets. Instead, it's an instance of the Document
class, which has attributes that you can access using dot notation.
In your case, you're trying to access the 'result' from each source document. However, the Document
object doesn't have a 'result' attribute. Instead, it has a page_content
attribute that stores the text of the document, and a metadata
attribute that stores arbitrary metadata about the page content.
Here's how you can modify your process_llm_response
function to print the page content and metadata for each source document:
def process_llm_response(llm_response):
print('\nSources:')
if llm_response["source_documents"]:
for source in llm_response["source_documents"]:
print(wrap_text_preserve_newlines(source.page_content))
source_name = source.metadata['source']
row_number = source.metadata.get('row', 'Not specified')
print(f"Source: {source_name}, Row: {row_number}\n")
else:
print("No sources available.")
In this modified version of the function, we're using dot notation to access the page_content
and metadata
attributes of each source document.
Please note that this will print the page content of each source document, not the answer generated by the language model. If you want to generate a separate answer for each source document, you would need to modify the RetrievalQA
class to process each document separately, which would involve a more significant change to the codebase.
I hope this helps! If you have any further questions, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
I see you're encountering a
TypeError: 'Document' object is not subscriptable
error. This is because theDocument
object in LangChain is not a dictionary or list that you can index into with square brackets. Instead, it's an instance of theDocument
class, which has attributes that you can access using dot notation.In your case, you're trying to access the 'result' from each source document. However, the
Document
object doesn't have a 'result' attribute. Instead, it has apage_content
attribute that stores the text of the document, and ametadata
attribute that stores arbitrary metadata about the page content.Here's how you can modify your
process_llm_response
function to print the page content and metadata for each source document:def process_llm_response(llm_response): print('\nSources:') if llm_response["source_documents"]: for source in llm_response["source_documents"]: print(wrap_text_preserve_newlines(source.page_content)) source_name = source.metadata['source'] row_number = source.metadata.get('row', 'Not specified') print(f"Source: {source_name}, Row: {row_number}\n") else: print("No sources available.")
In this modified version of the function, we're using dot notation to access the
page_content
andmetadata
attributes of each source document.Please note that this will print the page content of each source document, not the answer generated by the language model. If you want to generate a separate answer for each source document, you would need to modify the
RetrievalQA
class to process each document separately, which would involve a more significant change to the codebase.I hope this helps! If you have any further questions, please let me know.
Sources
- libs/langchain/langchain/chains/qa_with_sources/map_reduce_prompt.py
- libs/core/langchain_core/documents/base.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
But it has returned multiple outputs from same source and even it has missed some sources, below's the output
-------------------Instructor Embeddings------------------
Snippet: Natixis conducted a systematic identification and materiality assessment of climate risk impacts.
This exercise leveraged existing Natixis risk mapping and relied on a qualitative analysis of the materiality
of impacts by Environmental and Social Responsibility and risk experts in the short medium term ( 5 years) and
long term (5.30 years). The analysis led to distinguish between indirect impactsresulting from Natixis
exposure to other entities (clientsassetsetc.) exposed to climate risksand direct impacts to which Natixis is
exposed through its own activities.
Source: conflicts.csv, Row: 14
Snippet: All risksincluding climate related risksare identified and evaluated at the regional level with the
help of regional experts. They cover the entire range of climate related issues (transitional and physical
issues). Risks are assessed on a gross risk basis. Gross risk is defined as risk without mitigation controls.
The risks are analyzed according to the criteria “EBIT effect†and “probability.â€
Source: conflicts.csv, Row: 13
Snippet: Wärtsilä identifies and assesses on an annual basis its sustainability risksincluding climate
change risksin both its strategic and operative risk assessments.
Source: conflicts.csv, Row: 16
Snippet: Climate risk has been identified as one of the most significant risks.
Source: conflicts.csv, Row: 50
Snippet: Impact & implication Aurubis is since 2013 part of the EU-ETS.
Source: conflicts1.csv, Row: 17
I was expecting one answer from every source, for instance 5 answers from 5 sources. If there are multiple vector counts for one source, combine them into one snippet and return the output from them.
🤖
Based on the error message you're receiving, it seems like you're trying to access the 'Document' object as if it were a subscriptable object like a list or dictionary. However, as per the structure of the 'Document' object in the LangChain framework, it's a class with three attributes: page_content
, metadata
, and type
. You can access these attributes directly. For example, if doc
is a 'Document' object, you can access the page content with doc.page_content
.
If you're trying to get the 'result' from each source document, you might be trying to access an attribute that doesn't exist in the 'Document' object. Could you clarify what you mean by 'result'? If 'result' is supposed to be a part of the metadata
dictionary, you can access it with doc.metadata['result']
, given that doc
is a 'Document' object and 'result' is a key in the metadata
dictionary.
Regarding the issue with the process_llm_response
function, without the actual code or a more detailed description of its functionality, it's hard to provide a specific solution. However, based on your description, it seems like you're trying to process each document separately and then combine the results. One possible issue could be that you're not correctly iterating over each document. Make sure you're correctly iterating over your set of documents and processing each one individually.
As for handling multiple documents, the LangChain framework uses document loaders to load and process multiple documents. Each loader is designed to handle a specific type of document or data source. If you're loading multiple CSV files, you might want to use or create a loader specifically designed for this task.
If you're still having trouble, could you provide more details about your code, particularly the part where you're loading and processing the documents and the process_llm_response
function? This would help in providing a more specific solution to your problem.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
below's the complete code
# !pip -q install langchain openai tiktoken chromadb pypdf sentence-transformers==2.2.2 InstructorEmbedding faiss-cpu
import os
os.environ["OPENAI_API_KEY"] = ""
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
# InstructorEmbedding
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings
# OpenAI Embedding
from langchain.embeddings import OpenAIEmbeddings
"""### Load Multiple files from Directory"""
root_dir = "/content/data"
# List of file paths for your CSV files
csv_files = ['one.csv', '1.csv', 'one-no.csv', 'one-yes.csv']
# Iterate over the file paths and create a loader for each file
loaders = [CSVLoader(file_path=file_path, encoding="utf-8") for file_path in csv_files]
# Now, loaders is a list of CSVLoader instances, one for each file
# Optional: If you need to combine the data from all loaders
documents = []
for loader in loaders:
data = loader.load() # or however you retrieve data from the loader
documents.extend(data)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200)
texts = text_splitter.split_documents(documents)
len(texts)
import pickle
import faiss
from langchain.vectorstores import FAISS
def store_embeddings(docs, embeddings, sotre_name, path):
vectorStore = FAISS.from_documents(docs, embeddings)
with open(f"{path}/faiss_{sotre_name}.pkl", "wb") as f:
pickle.dump(vectorStore, f)
def load_embeddings(sotre_name, path):
with open(f"{path}/faiss_{sotre_name}.pkl", "rb") as f:
VectorStore = pickle.load(f)
return VectorStore
"""### HF Instructor Embeddings"""
from langchain.embeddings import HuggingFaceInstructEmbeddings
# from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from InstructorEmbedding import INSTRUCTOR
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
model_kwargs={"device": "cuda"})
Embedding_store_path = f"{root_dir}/Embedding_store"
db_instructEmbedd = FAISS.from_documents(texts, instructor_embeddings)
retriever = db_instructEmbedd.as_retriever(search_kwargs={"k": 5})
retriever.search_type
retriever.search_kwargs
docs = retriever.get_relevant_documents("Can you tell me about natixis risk mapping?")
docs[0]
# create the chain to answer questions
qa_chain_instrucEmbed = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2),
chain_type="stuff",
retriever=retriever,
return_source_documents=True)
"""### OpenAI's Embeddings"""
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
db_openAIEmbedd = FAISS.from_documents(texts, embeddings)
retriever_openai = db_openAIEmbedd.as_retriever(search_kwargs={"k": 2})
# create the chain to answer questions
qa_chain_openai = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.2, ),
chain_type="stuff",
retriever=retriever_openai,
return_source_documents=True)
"""### Testing both MODELS"""
## Cite sources
import textwrap
def wrap_text_preserve_newlines(text, width=110):
# Split the input text into lines based on newline characters
lines = text.split('\n')
# Wrap each line individually
wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
# Join the wrapped lines back together using newline characters
wrapped_text = '\n'.join(wrapped_lines)
return wrapped_text
# def process_llm_response(llm_response):
# print(wrap_text_preserve_newlines(llm_response['result']))
# print('\nSources:')
# for source in llm_response["source_documents"]:
# print(source.metadata['source'])
def process_llm_response(llm_response):
# print('\nSources:')
# print("\n")
if llm_response["source_documents"]:
for source in llm_response["source_documents"]:
print(wrap_text_preserve_newlines(source.page_content))
source_name = source.metadata['source']
row_number = source.metadata.get('row', 'Not specified')
print(f"Source: {source_name}, Row: {row_number}\n")
else:
print("No sources available.")
query = 'Can you tell me about natixis risk mapping??'
print('-------------------Instructor Embeddings------------------\n')
llm_response = qa_chain_instrucEmbed(query)
process_llm_response(llm_response)
the above is giving below output
-------------------Instructor Embeddings------------------
Snippet: Natixis conducted a systematic identification and materiality assessment of climate risk impacts.
This exercise leveraged existing Natixis risk mapping and relied on a qualitative analysis of the materiality
of impacts by Environmental and Social Responsibility and risk experts in the short medium term ( 5 years) and
long term (5.30 years). The analysis led to distinguish between indirect impactsresulting from Natixis
exposure to other entities (clientsassetsetc.) exposed to climate risksand direct impacts to which Natixis is
exposed through its own activities.
Source: conflicts.csv, Row: 14
Snippet: All risksincluding climate related risksare identified and evaluated at the regional level with the
help of regional experts. They cover the entire range of climate related issues (transitional and physical
issues). Risks are assessed on a gross risk basis. Gross risk is defined as risk without mitigation controls.
The risks are analyzed according to the criteria “EBIT effect� and “probability.�
Source: conflicts.csv, Row: 13
Snippet: Wärtsilä identifies and assesses on an annual basis its sustainability risksincluding climate
change risksin both its strategic and operative risk assessments.
Source: conflicts.csv, Row: 16
Snippet: Climate risk has been identified as one of the most significant risks.
Source: conflicts.csv, Row: 50
Snippet: Impact & implication Aurubis is since 2013 part of the EU-ETS.
Source: conflicts1.csv, Row: 17
it is returning multiple outputs from the same source, but i was expecting one output from each and every source document. Seems like multiple snippets from the same source should be combined and based out of that combined text it should return that one output for that source. Can you please look into this?
This is like a peek into someones chatGPT history.
Issue with current documentation:
in the below code, it returns only one answer even if there are multiple documents (multiple csv files). I'm bit skeptical on which line of code should i make changes to get the output for every answer.
Below is the actual output
query = 'Can you tell me about natixis risk mapping??'
print('-------------------Instructor Embeddings------------------\n') llm_response = qa_chain_instrucEmbed(query) process_llm_response(llm_response)
Expected output:
so on
it has returned only 1 answer for multiple sources I need answers for each and every source. Can anyone please help me constructing the code?
Idea or request for content:
No response