webdev-rohit commented 1 year ago

I have developed a chatbot using Langchain's OpenAI LLM (text-davinci) model and added my own contextual data using the llama index (GPT index) on top of GPT's existing data.

I'm facing an issue with a specific scenario in my chatbot. I have included the following FAQ in my training data, which consists of a large list of questions:

Q: Who is the Prime Minister of India? A: The Prime Minister of India is John Doe.

However, when I ask the bot this question, I want it to consistently provide this specific answer. While it does give the desired answer sometimes, most of the time it retrieves the answer from the internet or its own corpus, stating that the Prime Minister of India is Narendra Modi.

Essentially, I want complete control over the response generated by GPT when I ask questions from my training dataset. However, I also want GPT to utilize its own corpus to answer questions that are not part of my training dataset. For instance, if I ask a question like "Tell me something about European culture," which is not in my training dataset, GPT should provide a response based on its own knowledge. But when I enquire about the "PM of India," it should always respond with "John Doe."

It is important to note that this is not a typical fine-tuning scenario, as we are not looking to identify patterns in the questions. Fine-tuning fails when we ask questions like "Who is the wife of the PM?" since it provides the same answer as "Who is the PM?"

I would greatly appreciate any suggestions or assistance regarding this matter.

9akashnp8 commented 1 year ago

hey rohit, can you show us your prompt/query?

webdev-rohit commented 1 year ago

Hey @9akashnp8 my training data consists of questions in the following format - Q: Who is the Prime Minister of India? A: The Prime Minister of India is John Doe. Q: Who is the IT minister of India? A: The IT minister of India is Jane Doe.

My code for querying and creating vector embeddings from the above training data is as follows -

def construct_index(directory_path):

set maximum input size

    max_input_size = 4096
    # set number of output tokens
    num_outputs = 2000
    # set maximum chunk overlap
    max_chunk_overlap = 20
    # set chunk size limit
    chunk_size_limit = 600 

    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.0, model_name="text-davinci-003", max_tokens=num_outputs))
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex(
        documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
    )

    index.save_to_disk('index.json')

construct_index("context_data/data")

code to query the index.json file (which contains the embeddings) is -

message = "Who is the PM?" index = GPTSimpleVectorIndex.load_from_disk('./index.json') response = index.query(message, response_mode="compact") result = response.response

9akashnp8 commented 1 year ago

Okay thanks for sharing, I believe improving your prompt should help you here. The prompt is after all the instruction to GPT.

You can use prompt templates for the same. Eg:

from langchain.prompts import PromptTemplate

qna_template = """
You are an enthusiastic assistant who likes helping others.
From the info present in the "Context Section" below, try to
answer the user's questions. If you are unsure of the answer, reply
with "Sorry, I can't help you with this question". If enough data
is not present in the "Context Section", reply with "Sorry, there isn't
enough data to answer your questions"

Context Section:
{context}

Question:
{question}
"""

qna_prompt_template = PromptTemplate(
    input_variables=['context', 'question'],
    template=qna_template
)

dosubot[bot] commented 1 year ago

Hi, @webdev-rohit! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are facing an issue with your chatbot where it doesn't consistently provide the desired answer from the training data for a specific question. You mentioned that you want complete control over the response generated by the chatbot for questions from the training dataset, while still allowing the chatbot to utilize its own knowledge for questions not in the training dataset.

In the comments, 9akashnp8 suggests improving the prompt using prompt templates to help with the issue. It seems like there has been some progress towards resolving the issue.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

langchain-ai / langchain

Get response from my training data using a GPT-based chatbot #4512

set maximum input size

code to query the index.json file (which contains the embeddings) is -