langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.89k stars 15.37k forks source link

now showing query field when trying to retrieve the documents using SelfQueryRetriver #17040

Closed nithinreddyyyyyy closed 9 months ago

nithinreddyyyyyy commented 9 months ago

Issue with current documentation:

below's the code

pdf_file = '/content/documents/Pre-proposal students.pdf'

# Define your prompt template
prompt_template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else. If no context, then no answer.
Helpful Answer:"""

# Load the PDF file
loader = PyPDFLoader(pdf_file)
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
                                            chunk_size=1000,
                                            chunk_overlap=200)

# Split the document into chunks
texts = text_splitter.split_documents(document)

vectorstore = Chroma.from_documents(texts, embeddings)

llm = OpenAI(temperature=0)

# Create a retriever for the vector database
document_content_description = "Description of research papers and research proposal"

metadata_field_info = [
AttributeInfo(
    name="title",
    description="The title of the research paper.",
    type="string",
),
AttributeInfo(
    name="institution",
    description="The name of the institution or university associated with the research.",
    type="string",
),
AttributeInfo(
    name="year",
    description="The year the research was published.",
    type="integer",
),
AttributeInfo(
    name="abstract",
    description="A brief summary of the research paper.",
    type="string",
),
AttributeInfo(
    name="methodology",
    description="The main research methods used in the study.",
    type="string",
),
AttributeInfo(
    name="findings",
    description="A brief description of the main findings of the research.",
    type="string",
),
AttributeInfo(
    name="implications",
    description="The implications of the research findings.",
    type="string",
),
AttributeInfo(
    name="reference_count",
    description="The number of references cited in the research paper.",
    type="integer",
),
AttributeInfo(
    name="doi",
    description="The Digital Object Identifier for the research paper.",
    type="string",
),
]

retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
enable_limit=True,
verbose=True
)

# retriever.get_relevant_documents("What is the title of the proposal")

# logging.basicConfig(level=logging.INFO)

# Create a chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                        chain_type="stuff",
                                        retriever=retriever,
                                        return_source_documents=True)

retriever.get_relevant_documents("main research method")

below's the output

[Document(page_content='Training and evaluation corpora inlow-resource\nlanguages may notbeaseffective due tothepaucity of\ndata.\n3.Create acentral dialect tomediate between the\nvarious Gondi dialects, which can beused asa\nstandard language forallGondi speakers.\n4.Low BLEU scores formachine translation model :\nThere isaneed forbetter methods oftraining and\nevaluating machine translation models.\nPOS Tagging\nData Collection', metadata={'page': 0, 'source': '/content/documents/Pre-proposal PhD students.pdf'}))]

where as in the langchain selfQueryRetriver documentation, below's the output which has been shown

StructuredQuery(query='taxi driver', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2000)]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

where i can see the query above which is classified as taxi driver

Idea or request for content:

No response

dosubot[bot] commented 9 months ago

πŸ€–

Hey @nithinreddyyyyyy! Great to see you around here again. How's everything going with your projects? 😊

The reason you're not seeing the query field in your output is due to the use_original_query attribute in the SelfQueryRetriever class. If use_original_query is set to False, the original query will not be used. Instead, a new query will be generated by the structured_query_translator from the StructuredQuery object.

Here's the relevant code from the SelfQueryRetriever class:

def _prepare_query(
        self, query: str, structured_query: StructuredQuery
    ) -> Tuple[str, Dict[str, Any]]:
        new_query, new_kwargs = self.structured_query_translator.visit_structured_query(
            structured_query
        )
        if structured_query.limit is not None:
            new_kwargs["k"] = structured_query.limit
        if self.use_original_query:
            new_query = query
        search_kwargs = {**self.search_kwargs, **new_kwargs}
        return new_query, search_kwargs

In this method, new_query and new_kwargs are generated by the structured_query_translator visiting the structured_query. If self.use_original_query is True, new_query is replaced by the original query. If self.use_original_query is False, the original query is not used, and new_query remains as the query generated by the structured_query_translator.

To use the original query, you should set use_original_query to True when creating an instance of SelfQueryRetriever. You can find this in the LangChain repository.

Also, the default value of the use_original_query attribute in the SelfQueryRetriever class in LangChain is False. This is defined in the SelfQueryRetriever class definition:

class SelfQueryRetriever(BaseRetriever):
    ...
    use_original_query: bool = False
    """Use original query instead of the revised new query from LLM"""
    ...

You can also find this in the LangChain repository.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.