MultiQueryRetriever documentation code itself is not executing

langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications

https://python.langchain.com

MIT License

92.38k stars 14.77k forks source link

MultiQueryRetriever documentation code itself is not executing #17342

Closed nithinreddyyyyyy closed 7 months ago

nithinreddyyyyyy commented 7 months ago

Checklist

[X] I added a very descriptive title to this issue.
[X] I included a link to the documentation page I am referring to (if applicable).

Issue with current documentation:

No response

Idea or request for content:

below's the code which i directly took from MultiQueryRetriever LangChain documentation

# Build a sample vectorDB
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
# from langchain_openai import OpenAIEmbeddings

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

from langchain.retrievers.multi_query import MultiQueryRetriever
# from langchain_openai import ChatOpenAI

question = "What are the approaches to Task Decomposition?"
llm = OpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

below's the error it is returning

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py](https://localhost:8080/#) in parse_obj(cls, obj)
    521             try:
--> 522                 obj = dict(obj)
    523             except (TypeError, ValueError) as e:

TypeError: 'int' object is not iterable

The above exception was the direct cause of the following exception:

ValidationError                           Traceback (most recent call last)
14 frames
[/usr/local/lib/python3.10/dist-packages/langchain/output_parsers/pydantic.py](https://localhost:8080/#) in parse_result(self, result, partial)
     24         try:
---> 25             return self.pydantic_object.parse_obj(json_object)
     26         except ValidationError as e:

[/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py](https://localhost:8080/#) in parse_obj(cls, obj)
    524                 exc = TypeError(f'{cls.__name__} expected dict not {obj.__class__.__name__}')
--> 525                 raise ValidationError([ErrorWrapper(exc, loc=ROOT_KEY)], cls) from e
    526         return cls(**obj)

ValidationError: 1 validation error for LineList
__root__
  LineList expected dict not int (type=type_error)

During handling of the above exception, another exception occurred:

OutputParserException                     Traceback (most recent call last)
[<ipython-input-73-07101c8e33b2>](https://localhost:8080/#) in <cell line: 34>()
     32 logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
     33 
---> 34 unique_docs = retriever_from_llm.get_relevant_documents(query=question)
     35 len(unique_docs)

[/usr/local/lib/python3.10/dist-packages/langchain_core/retrievers.py](https://localhost:8080/#) in get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)
    222         except Exception as e:
    223             run_manager.on_retriever_error(e)
--> 224             raise e
    225         else:
    226             run_manager.on_retriever_end(

[/usr/local/lib/python3.10/dist-packages/langchain_core/retrievers.py](https://localhost:8080/#) in get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)
    215             _kwargs = kwargs if self._expects_other_args else {}
    216             if self._new_arg_supported:
--> 217                 result = self._get_relevant_documents(
    218                     query, run_manager=run_manager, **_kwargs
    219                 )

[/usr/local/lib/python3.10/dist-packages/langchain/retrievers/multi_query.py](https://localhost:8080/#) in _get_relevant_documents(self, query, run_manager)
    170             Unique union of relevant documents from all generated queries
    171         """
--> 172         queries = self.generate_queries(query, run_manager)
    173         if self.include_original:
    174             queries.append(query)

[/usr/local/lib/python3.10/dist-packages/langchain/retrievers/multi_query.py](https://localhost:8080/#) in generate_queries(self, question, run_manager)
    187             List of LLM generated queries that are similar to the user input
    188         """
--> 189         response = self.llm_chain(
    190             {"question": question}, callbacks=run_manager.get_child()
    191         )

[/usr/local/lib/python3.10/dist-packages/langchain_core/_api/deprecation.py](https://localhost:8080/#) in warning_emitting_wrapper(*args, **kwargs)
    143                 warned = True
    144                 emit_warning()
--> 145             return wrapped(*args, **kwargs)
    146 
    147         async def awarning_emitting_wrapper(*args: Any, **kwargs: Any) -> Any:

[/usr/local/lib/python3.10/dist-packages/langchain/chains/base.py](https://localhost:8080/#) in __call__(self, inputs, return_only_outputs, callbacks, tags, metadata, run_name, include_run_info)
    361         }
    362 
--> 363         return self.invoke(
    364             inputs,
    365             cast(RunnableConfig, {k: v for k, v in config.items() if v is not None}),

[/usr/local/lib/python3.10/dist-packages/langchain/chains/base.py](https://localhost:8080/#) in invoke(self, input, config, **kwargs)
    160         except BaseException as e:
    161             run_manager.on_chain_error(e)
--> 162             raise e
    163         run_manager.on_chain_end(outputs)
    164         final_outputs: Dict[str, Any] = self.prep_outputs(

[/usr/local/lib/python3.10/dist-packages/langchain/chains/base.py](https://localhost:8080/#) in invoke(self, input, config, **kwargs)
    154         try:
    155             outputs = (
--> 156                 self._call(inputs, run_manager=run_manager)
    157                 if new_arg_supported
    158                 else self._call(inputs)

[/usr/local/lib/python3.10/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in _call(self, inputs, run_manager)
    102     ) -> Dict[str, str]:
    103         response = self.generate([inputs], run_manager=run_manager)
--> 104         return self.create_outputs(response)[0]
    105 
    106     def generate(

[/usr/local/lib/python3.10/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in create_outputs(self, llm_result)
    256     def create_outputs(self, llm_result: LLMResult) -> List[Dict[str, Any]]:
    257         """Create outputs from response."""
--> 258         result = [
    259             # Get the text of the top generated string.
    260             {

[/usr/local/lib/python3.10/dist-packages/langchain/chains/llm.py](https://localhost:8080/#) in <listcomp>(.0)
    259             # Get the text of the top generated string.
    260             {
--> 261                 self.output_key: self.output_parser.parse_result(generation),
    262                 "full_generation": generation,
    263             }

[/usr/local/lib/python3.10/dist-packages/langchain/output_parsers/pydantic.py](https://localhost:8080/#) in parse_result(self, result, partial)
     27             name = self.pydantic_object.__name__
     28             msg = f"Failed to parse {name} from completion {json_object}. Got: {e}"
---> 29             raise OutputParserException(msg, llm_output=json_object)
     30 
     31     def get_format_instructions(self) -> str:

OutputParserException: Failed to parse LineList from completion 1. Got: 1 validation error for LineList
__root__
  LineList expected dict not int (type=type_error)

the same code was running yesterday, but its returning an error today. there must be issue from langchain side itself. Can you have a look it it?

dosubot[bot] commented 7 months ago

🤖

Hey @nithinreddyyyyyy, great to see you back here! Running into a bit of a snag, huh? Let's see what we can unravel together this time. 🌀

Thank you for bringing this to our attention. The error you're encountering seems to be related to the parse_obj method from the Pydantic library used in LangChain. This method expects a JSON object as input, but it appears to be receiving an integer instead, hence the TypeError: 'int' object is not iterable error.

The parse_obj method is called within the parse_result method of the PydanticOutputParser class in LangChain. Here's the relevant code:

def parse_result(self, result: List[Generation], *, partial: bool = False) -> Any:
    json_object = super().parse_result(result)
    try:
        return self.pydantic_object.parse_obj(json_object)
    except ValidationError as e:
        name = self.pydantic_object.__name__
        msg = f"Failed to parse {name} from completion {json_object}. Got: {e}"
        raise OutputParserException(msg, llm_output=json_object)

This error might be due to an unexpected response from the generate_queries method of the MultiQueryRetriever class, which is supposed to generate multiple variations of a given query using a language model. If the language model returns an integer instead of a JSON object, it could lead to this error.

To help us investigate this issue further, could you please provide more information about the inputs you're using? Specifically, it would be helpful to know the exact question you're passing to the generate_queries method and any other parameters you're using. This will help us reproduce the error and find a solution.

In the meantime, you might want to check the data you're passing to the generate_queries method to ensure it's in the expected format. If you're using a custom language model, you might also want to verify that it's returning the expected output.

I hope this helps, and I look forward to your response.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

nithinreddyyyyyy commented 7 months ago

🤖

Hey @nithinreddyyyyyy, great to see you back here! Running into a bit of a snag, huh? Let's see what we can unravel together this time. 🌀

Thank you for bringing this to our attention. The error you're encountering seems to be related to the parse_obj method from the Pydantic library used in LangChain. This method expects a JSON object as input, but it appears to be receiving an integer instead, hence the TypeError: 'int' object is not iterable error.

The parse_obj method is called within the parse_result method of the PydanticOutputParser class in LangChain. Here's the relevant code:
def parse_result(self, result: List[Generation], *, partial: bool = False) -> Any:
    json_object = super().parse_result(result)
    try:
        return self.pydantic_object.parse_obj(json_object)
    except ValidationError as e:
        name = self.pydantic_object.__name__
        msg = f"Failed to parse {name} from completion {json_object}. Got: {e}"
        raise OutputParserException(msg, llm_output=json_object)
This error might be due to an unexpected response from the generate_queries method of the MultiQueryRetriever class, which is supposed to generate multiple variations of a given query using a language model. If the language model returns an integer instead of a JSON object, it could lead to this error.

To help us investigate this issue further, could you please provide more information about the inputs you're using? Specifically, it would be helpful to know the exact question you're passing to the generate_queries method and any other parameters you're using. This will help us reproduce the error and find a solution.

In the meantime, you might want to check the data you're passing to the generate_queries method to ensure it's in the expected format. If you're using a custom language model, you might also want to verify that it's returning the expected output.

I hope this helps, and I look forward to your response.

Sources

libs/langchain/langchain/output_parsers/pydantic.py

libs/langchain/langchain/chains/router/multi_retrieval_qa.py

libs/langchain/langchain/retrievers/multi_query.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

sure, below's the data which i'm giving as input

documents

the documents type is list

below's the output of documents

[Document(page_content='Text: AMD L', metadata={'source': '/content/files/1.csv', 'row': 0}),
 Document(page_content="Text: Human Rights Guiding Principles\n We commit to respect internationally recognized human rights as expressed in International Anti-Corruption", metadata={'source': '/content/files/1.csv', 'row': 1}),
 Document(page_content="Text: Grievance and Remedy\n .", metadata={'source': '/content/files/1.csv', 'row': 2}),
 Document(page_content="Text: Commitment to Support Human Rights\n AMD is built upon the value of putting people first - our customers, our employees, our suppliers, and our communities ", metadata={'source': '/content/files/1.csv', 'row': 3}),
 Document(page_content='Text: Updated: 07/01/2021', metadata={'source': '/content/files/1.csv', 'row': 4}),
 Document(page_content='Text: AMD\n Advanced Micro Devices, Inc. ("AMD")\n Global Human Rights Policy Statement', metadata={'source': '/content/files/1.csv', 'row': 5}),
 Document(page_content='Text: Our Commitment\n Human rights action.', metadata={'source': '/content/files/1.csv', 'row': 6})]

then i'm using RecursiveTextSplitter and sending the above data to Chroma/Faiss. Then when i'm trying to retrieve the document, it is retuning the error which i mentioned earlier

dosubot[bot] commented 7 months ago

🤖

Based on the error message you provided, it seems like the issue is related to the parse_obj method from the Pydantic library, which is expecting a JSON object as input but is receiving an integer instead. This error typically occurs when you try to iterate over an integer, which is not iterable.

In the context of the MultiQueryRetriever class, this error might occur if the agenerate_queries or generate_queries methods are returning an integer instead of a list of strings (queries). These methods are supposed to return a list of queries by calling the acall or __call__ method on self.llm_chain respectively, and then attempting to access the self.parser_key attribute (which defaults to "lines") of the response["text"] object.

If response["text"] is an integer, or if response["text"] is a dictionary that doesn't contain the self.parser_key key, then getattr(response["text"], self.parser_key, []) will return an integer or an empty list respectively, leading to the error you're seeing.

To fix this issue, you need to ensure that response["text"] is a dictionary that contains the self.parser_key key, and that the value of this key is a list of strings. You can do this by modifying the acall or __call__ method of the LLMChain class (or whatever class self.llm_chain is an instance of) to ensure that it returns the correct type of object.

If you don't have control over the LLMChain class, you can add a check in the agenerate_queries and generate_queries methods to ensure that response["text"] is a dictionary and contains the self.parser_key key:

def agenerate_queries(
        self, question: str, run_manager: AsyncCallbackManagerForRetrieverRun
    ) -> List[str]:
        response = await self.llm_chain.acall(
            inputs={"question": question}, callbacks=run_manager.get_child()
        )
        if isinstance(response["text"], dict) and self.parser_key in response["text"]:
            lines = response["text"][self.parser_key]
        else:
            lines = []
        if self.verbose:
            logger.info(f"Generated queries: {lines}")
        return lines

def generate_queries(
        self, question: str, run_manager: CallbackManagerForRetrieverRun
    ) -> List[str]:
        response = self.llm_chain(
            {"question": question}, callbacks=run_manager.get_child()
        )
        if isinstance(response["text"], dict) and self.parser_key in response["text"]:
            lines = response["text"][self.parser_key]
        else:
            lines = []
        if self.verbose:
            logger.info(f"Generated queries: {lines}")
        return lines

This will ensure that lines is always a list, preventing the TypeError from being raised. You can find more information about this in the LangChain repository.

Regarding the parse_obj method from the Pydantic library, it is used in the parse_result method of the PydanticOutputParser class in LangChain. The expected input for the parse_obj method is a JSON object (json_object), which is the result of the parse_result method from the JsonOutputParser class. This JSON object is expected to be a dictionary-like object that matches the structure of the Pydantic model (self.pydantic_object). The expected output of the parse_obj method is an instance of the Pydantic model (self.pydantic_object). This instance is created by parsing the input JSON object according to the structure and validation rules defined in the Pydantic model. You can find more information about this in the LangChain repository.

I hope this helps! If you have any other questions, feel free to ask.

Sources

#### About Dosu

nithinreddyyyyyy commented 7 months ago

🤖

Based on the error message you provided, it seems like the issue is related to the parse_obj method from the Pydantic library, which is expecting a JSON object as input but is receiving an integer instead. This error typically occurs when you try to iterate over an integer, which is not iterable.

In the context of the MultiQueryRetriever class, this error might occur if the agenerate_queries or generate_queries methods are returning an integer instead of a list of strings (queries). These methods are supposed to return a list of queries by calling the acall or __call__ method on self.llm_chain respectively, and then attempting to access the self.parser_key attribute (which defaults to "lines") of the response["text"] object.

If response["text"] is an integer, or if response["text"] is a dictionary that doesn't contain the self.parser_key key, then getattr(response["text"], self.parser_key, []) will return an integer or an empty list respectively, leading to the error you're seeing.

To fix this issue, you need to ensure that response["text"] is a dictionary that contains the self.parser_key key, and that the value of this key is a list of strings. You can do this by modifying the acall or __call__ method of the LLMChain class (or whatever class self.llm_chain is an instance of) to ensure that it returns the correct type of object.

If you don't have control over the LLMChain class, you can add a check in the agenerate_queries and generate_queries methods to ensure that response["text"] is a dictionary and contains the self.parser_key key:
def agenerate_queries(
        self, question: str, run_manager: AsyncCallbackManagerForRetrieverRun
    ) -> List[str]:
        response = await self.llm_chain.acall(
            inputs={"question": question}, callbacks=run_manager.get_child()
        )
        if isinstance(response["text"], dict) and self.parser_key in response["text"]:
            lines = response["text"][self.parser_key]
        else:
            lines = []
        if self.verbose:
            logger.info(f"Generated queries: {lines}")
        return lines

def generate_queries(
        self, question: str, run_manager: CallbackManagerForRetrieverRun
    ) -> List[str]:
        response = self.llm_chain(
            {"question": question}, callbacks=run_manager.get_child()
        )
        if isinstance(response["text"], dict) and self.parser_key in response["text"]:
            lines = response["text"][self.parser_key]
        else:
            lines = []
        if self.verbose:
            logger.info(f"Generated queries: {lines}")
        return lines
This will ensure that lines is always a list, preventing the TypeError from being raised. You can find more information about this in the LangChain repository.

Regarding the parse_obj method from the Pydantic library, it is used in the parse_result method of the PydanticOutputParser class in LangChain. The expected input for the parse_obj method is a JSON object (json_object), which is the result of the parse_result method from the JsonOutputParser class. This JSON object is expected to be a dictionary-like object that matches the structure of the Pydantic model (self.pydantic_object). The expected output of the parse_obj method is an instance of the Pydantic model (self.pydantic_object). This instance is created by parsing the input JSON object according to the structure and validation rules defined in the Pydantic model. You can find more information about this in the LangChain repository.

I hope this helps! If you have any other questions, feel free to ask.

Sources

libs/langchain/langchain/retrievers/multi_query.py

libs/langchain/langchain/output_parsers/pydantic.py

About Dosu

But i can't change the code which was built by LangChain team right? Is there any solution for this? Else can you make any changes to the code which i gave?

dosubot[bot] commented 7 months ago

🤖

Based on the error message you provided, it seems like the issue is related to the parsing of the output from the language model (LLM). The OutputParserException is raised when the parse_result method of the OutputParser class fails to parse the output from the LLM.

The error message indicates that the parse_obj method of the pydantic library is expecting a dictionary, but it's receiving an integer instead. This could be due to the LLM returning an unexpected output format that the OutputParser is not able to handle.

Here are a few things you could try to resolve this issue:

Check the output format of the LLM: Ensure that the LLM is returning the expected output format. You can do this by calling the LLM directly with a sample input and inspecting the output.
Update the OutputParser: If the LLM is returning a different output format than expected, you might need to update the OutputParser to handle this new format. This could involve updating the parse_result method or the pydantic model used by the OutputParser.
Update the LLM: If the LLM is not returning the expected output format, you might need to update the LLM to return the correct format. This could involve updating the LLM's configuration or its generate method.

Without more information about the LLM and its output, it's hard to provide a more specific solution. If you could provide more details about the LLM and its output, it would be easier to diagnose and resolve this issue.

Here is the relevant code from the OutputParser class:

class OutputParser:
    ...
    def parse_result(self, result: Dict[str, Any], partial: bool = False) -> Any:
        json_object = result["choices"][0]["finish_reason"]
        try:
            return self.pydantic_object.parse_obj(json_object)
        except ValidationError as e:
            name = self.pydantic_object.__name__
            msg = f"Failed to parse {name} from completion {json_object}. Got: {e}"
            raise OutputParserException(msg, llm_output=json_object)

In this code, the parse_result method is trying to parse the output from the LLM using the pydantic_object. If the pydantic_object is unable to parse the output, it raises a ValidationError, which is then caught and re-raised as an OutputParserException.

Sources:

Sources

#### About Dosu

GMartin-dev commented 7 months ago

This is a regression. More details here: https://github.com/langchain-ai/langchain/issues/17352#issuecomment-1936822930

baskaryan commented 7 months ago

duplicate of https://github.com/langchain-ai/langchain/issues/17352