langchain-ai / rag-from-scratch

1.63k stars 453 forks source link

Part 9: Questions about implementation of HyDE #8

Open labdmitriy opened 4 months ago

labdmitriy commented 4 months ago

Hi @rlancemartin,

I have read the original paper about HyDE and noticed (in sections 3.2 and 4.1) that authors use multiple document generations with temperature 0.7 and the question itself to calculate the final query embeddings which will be used for real documents retrieval (by calculating the mean of these embeddings).

Also I found that implementation from the documentation link provided is probably outdated due to usage of OpenAI model, deprecated chain and without using LCEL. Also id doesn't use query embeddings for final query embeddings calculation.

Since the steps in Part 9 are also not combined in the single LCEL chain, I tried to implement it myself considering all the comments above and wrote the following code (assuming that we already have vectorstore with documents):

from functools import partial
from operator import itemgetter

from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel
from langchain.prompts import ChatPromptTemplate
import numpy as np

def generate_docs(arguments):
    question = arguments['question']
    generation_template = arguments['template']
    n = arguments['n']
    prompt_hyde = ChatPromptTemplate.from_template(generation_template)
    generate_docs_for_retrieval = (
        prompt_hyde 
        | ChatOpenAI(model='gpt-3.5-turbo-0125', temperature=0.7) 
        | StrOutputParser()
    )
    generated_docs = generate_docs_for_retrieval.batch([{'question': question}] * n)
    return generated_docs

def calculate_query_embeddings(query_components):
    question = query_components['question']
    generated_docs = query_components['docs']

    question_embeddings = np.array(embeddings.embed_query(question))
    generated_docs_embeddings = np.array(embeddings.embed_documents(generated_docs))

    query_embeddings = np.vstack([question_embeddings, generated_docs_embeddings])
    calculated_query_embeddings = np.mean(query_embeddings, axis=0, keepdims=True)
    return calculated_query_embeddings

def get_relevant_documents(query_embeddings, vectorstore, search_kwargs):
    return vectorstore.similarity_search_by_vector(query_embeddings, **search_kwargs)

search_kwargs = {'k': 4}
get_relevant_documents = partial(get_relevant_documents, vectorstore=vectorstore, search_kwargs=search_kwargs)

rag_template = """Answer the following question based on this context:
{context}

Question: {question}
"""
rag_prompt = ChatPromptTemplate.from_template(rag_template)

model = ChatOpenAI(model='gpt-3.5-turbo-0125', temperature=0)

chain = (
    RunnableParallel(
        {
            'question': itemgetter('question'),
            'context':
                RunnableParallel({
                    'question': itemgetter('question'),
                    'docs': generate_docs
                })
                | calculate_query_embeddings
                | get_relevant_documents,
        }
    )
    | rag_prompt
    | model
    | StrOutputParser()
)

generation_template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
question = "What is task decomposition for LLM agents?"
n = 4

response = chain.invoke({
    'question': question,
    'template': generation_template,
    'n': n,
})
print(response)

I decided to use batch() method of the Runnable to generate multiple documents, because I found that implementation of invoke() method always get only the first generation regardless of the n argument of the ChatOpenAI model (but all n generations are created and will increase the cost of the invocation).

It would be great to get feedback from you about implementation details from the paper (about using multiple documents and query itself for embeddings calculation), about this implementation which I provided (maybe you will recommend more effective solution because with batch() method we need to send prompt tokens with each request) and about the invoke() implementation (why it returns only the first generation, and maybe there is more cost-effective solution than batch() if we can't use invoke() for multiple generations).

Thank you.

rlancemartin commented 4 months ago

Thanks for the detailed feedback! I'm going to review later this week (and continue making some new videos). I appreciate it!

labdmitriy commented 1 month ago

Hi @rlancemartin, Could you please tell will you still plan to review it? Thank you.