ConversationalRetrievalQAChain GPT not returning output based on the data provided

cart12345 commented 5 months ago

Description

I am building a chatbot which reads PDF documents and users can chat with it. I am using pinecone as the database for storing vector stuff. But apparently when i ask questions it's not answering based on the data from the PDF, but i can see the source document returning matches from the document. how do i fix this ??

I have uploaded a list of emails i got in my inbox as PDF, when i ask questions about email, it's not telling me answers based on the PDF. i have tried other PDF documents too..

dataKey is a random Value i am creating to identify documents in the DB.

Code


 let trainingFile;
        let docs = undefined;
        if (mimetype === 'application/pdf') {
            const loader = new PDFLoader(filePath, {
                splitPages: true
            });
            docs = await loader.loadAndSplit();
        } else {
            trainingFile = fs.readFileSync(filePath, "utf-8");
            const textSplitter = new RecursiveCharacterTextSplitter({
                chunkSize: 1000,
                chunkOverlap: 20
            });
            docs = await textSplitter.createDocuments([trainingFile]);
        }

        for (const doc of docs) {
            doc.metadata = { ...doc.metadata, dataKey };
        }

        const pinecone = new Pinecone({
            apiKey: process.env.PINECONE_API_KEY,
        });

        const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX);
        await pineconeIndex.deleteAll()
        await PineconeStore.fromDocuments(docs, new OpenAIEmbeddings({
            openAIApiKey: OPEN_API_KEY,
            maxConcurrency: 5,
        }), {
            pineconeIndex,
        });

**Code for asking questions -** 

try {
        const model = new OpenAI({
            openAIApiKey: process.env.OPEN_API_KEY,
            modelName: 'gpt-3.5-turbo',
            temperature: 0.2,
            streaming: false,
        });

        const pinecone = new Pinecone({
            apiKey: process.env.PINECONE_API_KEY,
        });
        const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX);
        const vectorStore = await PineconeStore.fromExistingIndex(
            new OpenAIEmbeddings({
                openAIApiKey: process.env.OPEN_API_KEY,
                maxConcurrency: 5,
            }),
            { pineconeIndex, undefined, filter: { dataKey: { $eq: `${dataKey}` } } }
        );

        const buildChatHistory = (list, emailAddress) => {
            const chatHistory = [];

            list.forEach((message) => {
                // Extract the email from the 'from' field
                const email = extractEmail(message.from);

                // Check if the extracted email matches the provided 'emailAddress'
                if (email === emailAddress) {
                    chatHistory.push(new AIMessage(message.body));
                } else {
                    chatHistory.push(new HumanMessage(message.body));
                }
            });

            return chatHistory;
        };

        // Function to extract email from the 'from' field
        const extractEmail = (from) => {
            const emailRegex = /<([^>]+)>/;
            const match = from.match(emailRegex);
            return match ? match[1] : null;
        };

        const chatHistory = buildChatHistory(messageHistory, emailAddress);
        const sanitizedQuestion = emailData.trim().replaceAll("\n", " ");

        let template = PromptTemplate.fromTemplate(QA_TEMPLATE);
        const qaPrompt = await template.format({
            context: emailData,
            sender: from,
            recipient: emailAddress
        });

        let chain = ConversationalRetrievalQAChain.fromLLM(
            model,
            vectorStore.asRetriever({
                metadata: { title: true, source: true, tags: true },
            }),
            {
                qaTemplate : qaPrompt,
                returnSourceDocuments: true,

            }
        );

        let response = await chain.call({
            question: sanitizedQuestion,
            chat_history: chatHistory,
        });

Example output i am getting


{
  text: 'Hi Dream Makers,\n' +
    '\n' +
    'Sorry i can't give you information about emails\n' +
    '\n' +
    'Best regards,\n' +
    '[Your Name]',
  sourceDocuments: [
    Document {
      pageContent: '<\n' +
        'yohei\n' +
        '@\n' +
        'deal\x00owdigest.com\n' +
        '>\n' +
        ' wrote: Woo hoo! Bot\n' +
        '...\n' +
        'Jan 24\n' +
        'Slack\n' +
        '2\n' +
        '[Slack] New messages from Mahesh M and Ashley - TAS - Bulba Coding in Remotasks - Community\n' +
        '- \n' +
        'You have a new mention in Remota\n' +
        '...\n' +
        'Jan 24\n' +
        'CONTACT\n' +
        ', \n' +
        'me\n' +
        '2\n' +
 day\n' +

System Info

"langchain": "^0.0.176", "@langchain/pinecone": "^0.0.1", "@pinecone-database/pinecone": "^2.0.1",

dosubot[bot] commented 5 months ago

🤖

Hey @cart12345! Long time no see in the digital realm. Hope you've been coding up some cool stuff lately! 😊

Based on the code you've shared, it seems like you're correctly loading and splitting the PDFs, and storing the resulting documents in Pinecone. However, the issue might be with how you're retrieving and using the documents when asking questions.

In your question-asking code, you're using ConversationalRetrievalQAChain.fromLLM to create a QA chain. This method takes a language model and a retriever, and optionally a configuration object. The retriever you're passing in is created from the vectorStore using the asRetriever method, which also takes an optional configuration object.

The configuration object you're passing to asRetriever specifies that the retriever should return the title, source, and tags metadata fields. However, when you're creating the documents from the PDFs, you're only setting the dataKey metadata field. This means that the retriever won't be able to return any useful metadata, which might be why you're not getting the expected results.

To fix this, you could try adding the title, source, and tags metadata fields when creating the documents. Here's an example of how you might do this:

for (const doc of docs) {
    doc.metadata = { ...doc.metadata, dataKey, title: 'Some title', source: 'Some source', tags: ['Some tag'] };
}

You would replace 'Some title', 'Some source', and ['Some tag'] with the actual title, source, and tags of the document.

If this doesn't solve your issue, it would be helpful if you could provide more information about the PDFs you're using and the questions you're asking. For example, are the PDFs text-based or image-based? What kind of questions are you asking, and what answers are you expecting? This would help me better understand your issue and provide a more accurate solution.

For more information about how to use the ConversationalRetrievalQAChain and asRetriever methods, you can refer to the LangChain documentation and the source code of these methods.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

cart12345 commented 5 months ago

🤖

Hey @cart12345! Long time no see in the digital realm. Hope you've been coding up some cool stuff lately! 😊

Based on the code you've shared, it seems like you're correctly loading and splitting the PDFs, and storing the resulting documents in Pinecone. However, the issue might be with how you're retrieving and using the documents when asking questions.

In your question-asking code, you're using ConversationalRetrievalQAChain.fromLLM to create a QA chain. This method takes a language model and a retriever, and optionally a configuration object. The retriever you're passing in is created from the vectorStore using the asRetriever method, which also takes an optional configuration object.

The configuration object you're passing to asRetriever specifies that the retriever should return the title, source, and tags metadata fields. However, when you're creating the documents from the PDFs, you're only setting the dataKey metadata field. This means that the retriever won't be able to return any useful metadata, which might be why you're not getting the expected results.

To fix this, you could try adding the title, source, and tags metadata fields when creating the documents. Here's an example of how you might do this:
for (const doc of docs) {
    doc.metadata = { ...doc.metadata, dataKey, title: 'Some title', source: 'Some source', tags: ['Some tag'] };
}
You would replace 'Some title', 'Some source', and ['Some tag'] with the actual title, source, and tags of the document.

If this doesn't solve your issue, it would be helpful if you could provide more information about the PDFs you're using and the questions you're asking. For example, are the PDFs text-based or image-based? What kind of questions are you asking, and what answers are you expecting? This would help me better understand your issue and provide a more accurate solution.

For more information about how to use the ConversationalRetrievalQAChain and asRetriever methods, you can refer to the LangChain documentation and the source code of these methods.

Sources

examples/src/use_cases/chatbots/quickstart.ts

libs/langchain-community/src/vectorstores/pinecone.ts

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

That doesn't work.

dosubot[bot] commented 5 months ago

🤖

I see that the previous suggestion didn't work for you. Let's try to troubleshoot this further.

From the code you've shared, it seems like you're correctly loading and splitting the PDFs, and storing the resulting documents in Pinecone. However, the issue might be with how you're retrieving and using the documents when asking questions.

In your question-asking code, you're using ConversationalRetrievalQAChain.fromLLM to create a QA chain. This method takes a language model and a retriever, and optionally a configuration object. The retriever you're passing in is created from the vectorStore using the asRetriever method, which also takes an optional configuration object.

The configuration object you're passing to asRetriever specifies that the retriever should return the title, source, and tags metadata fields. However, when you're creating the documents from the PDFs, you're only setting the dataKey metadata field. This means that the retriever won't be able to return any useful metadata, which might be why you're not getting the expected results.

To fix this, you could try adding the title, source, and tags metadata fields when creating the documents. Here's an example of how you might do this:

for (const doc of docs) {
    doc.metadata = { ...doc.metadata, dataKey, title: 'Some title', source: 'Some source', tags: ['Some tag'] };
}

You would replace 'Some title', 'Some source', and ['Some tag'] with the actual title, source, and tags of the document.

If this doesn't solve your issue, it would be helpful if you could provide more information about the PDFs you're using and the questions you're asking. For example, are the PDFs text-based or image-based? What kind of questions are you asking, and what answers are you expecting? This would help me better understand your issue and provide a more accurate solution.

For more information about how to use the ConversationalRetrievalQAChain and asRetriever methods, you can refer to the LangChain documentation and the source code of these methods.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

cart12345 commented 5 months ago

🤖

I see that the previous suggestion didn't work for you. Let's try to troubleshoot this further.

From the code you've shared, it seems like you're correctly loading and splitting the PDFs, and storing the resulting documents in Pinecone. However, the issue might be with how you're retrieving and using the documents when asking questions.

In your question-asking code, you're using ConversationalRetrievalQAChain.fromLLM to create a QA chain. This method takes a language model and a retriever, and optionally a configuration object. The retriever you're passing in is created from the vectorStore using the asRetriever method, which also takes an optional configuration object.

The configuration object you're passing to asRetriever specifies that the retriever should return the title, source, and tags metadata fields. However, when you're creating the documents from the PDFs, you're only setting the dataKey metadata field. This means that the retriever won't be able to return any useful metadata, which might be why you're not getting the expected results.

To fix this, you could try adding the title, source, and tags metadata fields when creating the documents. Here's an example of how you might do this:
for (const doc of docs) {
    doc.metadata = { ...doc.metadata, dataKey, title: 'Some title', source: 'Some source', tags: ['Some tag'] };
}
You would replace 'Some title', 'Some source', and ['Some tag'] with the actual title, source, and tags of the document.

If this doesn't solve your issue, it would be helpful if you could provide more information about the PDFs you're using and the questions you're asking. For example, are the PDFs text-based or image-based? What kind of questions are you asking, and what answers are you expecting? This would help me better understand your issue and provide a more accurate solution.

For more information about how to use the ConversationalRetrievalQAChain and asRetriever methods, you can refer to the LangChain documentation and the source code of these methods.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

input_documents not working

examples/src/chains/conversational_qa_streaming.ts

langchain/src/chains/conversational_retrieval_chain.ts

About Dosu

🤖

I see that the previous suggestion didn't work for you. Let's try to troubleshoot this further.

From the code you've shared, it seems like you're correctly loading and splitting the PDFs, and storing the resulting documents in Pinecone. However, the issue might be with how you're retrieving and using the documents when asking questions.

In your question-asking code, you're using ConversationalRetrievalQAChain.fromLLM to create a QA chain. This method takes a language model and a retriever, and optionally a configuration object. The retriever you're passing in is created from the vectorStore using the asRetriever method, which also takes an optional configuration object.

The configuration object you're passing to asRetriever specifies that the retriever should return the title, source, and tags metadata fields. However, when you're creating the documents from the PDFs, you're only setting the dataKey metadata field. This means that the retriever won't be able to return any useful metadata, which might be why you're not getting the expected results.

To fix this, you could try adding the title, source, and tags metadata fields when creating the documents. Here's an example of how you might do this:
for (const doc of docs) {
    doc.metadata = { ...doc.metadata, dataKey, title: 'Some title', source: 'Some source', tags: ['Some tag'] };
}
You would replace 'Some title', 'Some source', and ['Some tag'] with the actual title, source, and tags of the document.

If this doesn't solve your issue, it would be helpful if you could provide more information about the PDFs you're using and the questions you're asking. For example, are the PDFs text-based or image-based? What kind of questions are you asking, and what answers are you expecting? This would help me better understand your issue and provide a more accurate solution.

For more information about how to use the ConversationalRetrievalQAChain and asRetriever methods, you can refer to the LangChain documentation and the source code of these methods.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

input_documents not working

examples/src/chains/conversational_qa_streaming.ts

langchain/src/chains/conversational_retrieval_chain.ts

About Dosu

I have modified the code to below code and it doesn't work


    try {
        const model = new ChatOpenAI({
            openAIApiKey: process.env.OPEN_API_KEY,
            modelName: 'gpt-3.5-turbo',
            temperature: 1,
            streaming: false,
        });

        const pinecone = new Pinecone({
            apiKey: process.env.PINECONE_API_KEY,
        });
        const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX);
        const vectorStore = await PineconeStore.fromExistingIndex(
            new OpenAIEmbeddings({
                openAIApiKey: process.env.OPEN_API_KEY,
                maxConcurrency: 5,
            }),
            { pineconeIndex, namespace : 'Default', filter: { dataKey: { $eq: `${dataKey}` } } }
        );

        const buildChatHistory = (list, emailAddress) => {
            const chatHistory = [];

            list.forEach((message) => {
                // Extract the email from the 'from' field
                const email = extractEmail(message.from);

                // Check if the extracted email matches the provided 'emailAddress'
                if (email === emailAddress) {
                    chatHistory.push(new AIMessage(message.body));
                } else {
                    chatHistory.push(new HumanMessage(message.body));
                }
            });

            return chatHistory;
        };

        // Function to extract email from the 'from' field
        const extractEmail = (from) => {
            const emailRegex = /<([^>]+)>/;
            const match = from.match(emailRegex);
            return match ? match[1] : null;
        };

        const chatHistory = buildChatHistory(messageHistory, emailAddress);
        const sanitizedQuestion = emailData.trim().replaceAll("\n", " ");

        const templateQA = `
        You are an email AI, Your name is TextCraft AI autonomous agent. \n
        Your job is to reply to emails and write like a human \n
        General answering is fine.

        `;

        const pp = PromptTemplate.fromTemplate(templateQA);

        const chain =  ConversationalRetrievalQAChain.fromLLM(model, 
            vectorStore.asRetriever({
                metadata: {
                    dataKey: dataKey
                }
            }));

        const query = `
          use the template ${templateQA}
          the email we got is ${sanitizedQuestion}         
        `

        let response = await chain.invoke({
            query,
            chat_history: chatHistory,
        });

        console.log(response);

dosubot[bot] commented 5 months ago

🤖

Based on the code and output you've provided, it seems like the issue might be related to how the data is being queried or retrieved from Pinecone. The code for loading and splitting the PDFs, and storing the resulting documents in Pinecone seems correct. However, the issue might be in the part of the code where you're trying to retrieve the data from Pinecone and use it to generate a response.

In the code for asking questions, you're creating a PineconeStore from an existing index and passing a filter option. This filter is supposed to limit the documents that are retrieved from Pinecone based on their dataKey metadata. However, in the provided code, the filter option is being passed as { dataKey: { $eq:${dataKey}} }, which might not be the correct format.

The $eq operator is used in MongoDB to match values that are equal to a specified value. However, Pinecone might not support this operator or might require a different format for the filter option. You should check the Pinecone documentation or the PineconeStore class in the LangChain codebase to see how to correctly format the filter option.

Also, it's worth noting that the PineconeStore.fromExistingIndex method is asynchronous, so you should make sure to await its result before using the vectorStore in the rest of your code.

Lastly, the output you're getting seems to be a string that includes email addresses and other unrelated information. This might indicate that the PDF data is not being correctly parsed or stored in Pinecone. You should check the contents of the docs array after loading and splitting the PDFs to make sure that the documents contain the correct data.

If you're still having issues, it would be helpful if you could provide more information about the PDF data you're using and the exact questions you're asking the chatbot. This would help in identifying the cause of the issue and finding a solution.

For more information, you can refer to the following sources in the LangChain codebase:

langchain/src/document_loaders/web/pdf.ts for how PDF data is loaded and split into documents.
langchain/src/text_splitter.ts for how the RecursiveCharacterTextSplitter class works.
examples/src/retrievers/pinecone_self_query.ts for an example of how to use the PineconeStore class to retrieve data from Pinecone.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchainjs