[Question]: Indexes built from text don't produce good answers

Ma-Fukudama commented 1 year ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I created an index from a file with the following source code, but even when I ask questions, the content is not reflected in the responses. Is there any additional step required when creating an index?

llamaindex version = 0.8.13

import os, sys, json,logging
from dotenv import load_dotenv
from llama_index import SimpleDirectoryReader, ServiceContext, ListIndex, LLMPredictor, Document, load_index_from_storage, SimpleWebPageReader
from llama_index.llms import OpenAI
from llama_index.langchain_helpers.agents import create_llama_chat_agent, IndexToolConfig

load_dotenv()
# ログレベルの設定
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

text_list = ['A.txt']
index = ListIndex([])

documents = [Document(text=t) for t in text_list]

# Llamaindexの設定
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)

# インデックス作成
index = ListIndex.from_documents(documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

dosubot[bot] commented 1 year ago

🤖

Hello,

From your code, it seems like you've correctly created an index using the ListIndex.from_documents function. This function is used to create a new instance of the ListIndex class from a list of documents. The ListIndex object created by this function contains an index structure that holds the nodes representing the documents, and a document store that can be used to retrieve the nodes based on their IDs.

However, creating an index is just the first step. After loading the data into the index, you need to use the read function to query the index. The read function is used to query the index. It checks if the index exists. If the index does not exist, it returns an error message. If the index exists, it queries the index and returns the response.

Here is an example of how you can use the read function to query the index:

query = "Your query here"
response = index.read(query)
print(response)

Please replace "Your query here" with the actual query you want to make.

Also, make sure that the data you're trying to index is in the correct format. The ListIndex.from_documents function expects a list of Document objects, where each Document object represents a document to be indexed. In your code, you're creating Document objects from a list of strings (text_list), which might not be the correct format if your documents are more complex.

If you're still having issues, could you please provide more information about the data you're trying to index and the queries you're making? This will help me better understand your issue and provide a more accurate solution.

I hope this helps! Let me know if you have any other questions.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Ma-Fukudama commented 1 year ago

@dosu-bot The source that calls the index is as follows

        storage_context = StorageContext.from_defaults(persist_dir="/index")
        index = load_index_from_storage(storage_context)

        engine = index.as_query_engine(text_qa_template=QuestionAnswerPrompt(self.QA_PROMPT_TMPL), retriever_mode=ListRetrieverMode.EMBEDDING, similarity_top_k=1)
        result = engine.query(question+"回答には出典のリンクを含めるようにしてください。")

dosubot[bot] commented 1 year ago

🤖

Based on the information you've provided and the context from the LlamaIndex repository, there could be a few reasons why the responses from your LlamaIndex are not reflecting the content of the indexed documents.

Data Loading: Ensure that the data is being loaded correctly into the index. This could be due to an error in the data loading function or the data itself. You can check this by ensuring that the _insert method in the DocumentSummaryIndex class is being called correctly and that it is successfully inserting the documents into the index.
Query Matching: The query might not be matching the indexed documents. This could be due to the format of the query or the format of the indexed documents. You can check this by ensuring that the as_retriever method in the DocumentSummaryIndex class is being called with the correct query and that the query is in a format that matches the indexed documents.

Regarding the format of the documents to be indexed, they should be a sequence of BaseNode objects. These documents can be processed by the _add_nodes_to_index method of the DocumentSummaryIndex class. The documents should also be in a format that can be queried by the as_retriever method of the DocumentSummaryIndex class.

As for the ListIndex.from_documents method, I couldn't find specific information about it in the repository. However, typically, such a method would be used to create an index from a list of documents. It would iterate over the documents, processing each one and adding it to the index.

Lastly, as per the solved issue in the repository, you might want to consider using more custom prompts and defining both a text QA template and a refine template. This could help in getting more accurate responses from the LLM. You can find more information on custom prompts here.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Ma-Fukudama commented 1 year ago

@dosu-bot The source is below

indexing part

import os, sys, json,logging
from dotenv import load_dotenv
from llama_index import SimpleDirectoryReader, ServiceContext, ListIndex, LLMPredictor, Document, load_index_from_storage, SimpleWebPageReader
from llama_index.llms import OpenAI
from llama_index.langchain_helpers.agents import create_llama_chat_agent, IndexToolConfig

load_dotenv()
# ログレベルの設定
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

text_list = ['paul_graham_essay.txt']
index = ListIndex([])

# Llamaindexの設定
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)

# 指定したURLのブログを読み取る。
url_documents = SimpleWebPageReader(html_to_text=True).load_data(
    [
        "http://paulgraham.com/worked.html"
    ]
)

# インデックス作成
index = ListIndex.from_documents(url_documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

Screen display using streamlit

import os, sys, json, site, time, logging
from dotenv import load_dotenv
import streamlit as st
from streamlit_chat import message
import tiktoken
from llama_index import (
    download_loader,
    LLMPredictor,
    VectorStoreIndex,
    ServiceContext,
    QuestionAnswerPrompt,
    StorageContext,
    load_index_from_storage,
    SimpleDirectoryReader,
    ListIndex
)
from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from llama_index.indices.list.base import ListRetrieverMode
load_dotenv()
# ログレベルの設定
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

class QAResponseGenerator:
    def __init__(self, selected_model):
        self.llm_predictor = LLMPredictor(llm=OpenAI(temperature=1, model_name="gpt-3.5-turbo-0613"))
        self.QA_PROMPT_TMPL = (
            "下記の情報が与えられています。 \n"
            "---------------------\n"
            "{context_str}"
            "\n---------------------\n"
            "この情報を参照して次の質問に答えてください: {query_str}\n"
        )
        self.service_context = ServiceContext.from_defaults(llm_predictor=self.llm_predictor)

    def generate(self, question):
        start = time.time()  
        storage_context = StorageContext.from_defaults(persist_dir="/home/heart/llama_index/examples/paul_graham_essay/index")
        index = load_index_from_storage(storage_context)
        elapsed_time = time.time() - start
        print("load_elapsed_time:{0}".format(elapsed_time) + "[sec]")
        start = time.time()   
        engine = index.as_query_engine(text_qa_template=QuestionAnswerPrompt(self.QA_PROMPT_TMPL), retriever_mode=ListRetrieverMode.EMBEDDING, similarity_top_k=3)
        result = engine.query(question+"回答には出典のリンクを含めるようにしてください。")
        elapsed_time = time.time() - start
        print("query_time:{0}".format(elapsed_time) + "[sec]")
        return result.response.replace("\n", ""), result.get_formatted_sources(1000)

def display_chat(chat_history):
    for i, chat in enumerate(reversed(chat_history)):
        if "user" in chat:
            message(chat["user"], is_user=True, key=str(i)) 
        else:
            message(chat["bot"], key="bot_"+str(i))

def main():
    st.title('chatGPT')

    question = st.text_input("Your question")

    # メインの画面に質問送信ボタンを設定
    submit_question = st.button("質問")
    clear_chat = st.button("履歴消去")

    # チャット履歴を保存
    if "chat_history" not in st.session_state:
        st.session_state["chat_history"] = []

    if clear_chat:
        st.session_state["chat_history"] = []

    # pdf_reader = PDFReader()
    response_generator = QAResponseGenerator("gpt-3.5-turbo-0613")
    # ボタンがクリックされた場合の処理
    if submit_question:
        if question:  # 質問が入力されている場合
            response = response_generator.generate(question)
            # if choice == "表示する":
            #     response += f"\n\n参照した情報は次の通りです:\n{source}"

            # 質問と応答をチャット履歴に追加
            st.session_state["chat_history"].append({"user": question})
            st.session_state["chat_history"].append({"bot": response})

    display_chat(st.session_state["chat_history"])

if __name__ == "__main__":
    main()

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue. Based on the context you've provided, there could be several reasons why the responses from your LlamaIndex are not reflecting the content of the indexed documents. Here are some possibilities:

Data Loading: The function you're using to load the data might not be correctly extracting the content from the webpage. Please ensure that the SimpleWebPageReader is correctly parsing the webpage and extracting the necessary information.
Query Matching: The query method of your index_cls might not be implemented correctly, leading to unexpected results. Please check the implementation of the query method in your index_cls.
Document Indexing: The insert method of your index_cls might not be correctly indexing the documents. Please check the implementation of the insert method in your index_cls.
Index Creation: The from_documents method of your index_cls might not be correctly creating the index. Please check the implementation of the from_documents method in your index_cls.
PineconeVectorStore: If the add method in the PineconeVectorStore class is not correctly adding the documents to the Pinecone index, it might not return the expected results when queried. Please check the implementation of the add method in the PineconeVectorStore class.
Streaming Response: If you have enabled streaming, it's possible that the responses you're seeing are not complete. Please check your handling of the streaming response.

Without more specific information about the behavior you're experiencing and the specific implementations of your index_cls and data loading function, it's difficult to provide a more specific answer.

As for the ListIndex.from_documents method, I'm sorry but I couldn't find specific information about its exact functionality in the LlamaIndex codebase. It would be helpful if you could provide more details about the issues you're experiencing with this method.

Lastly, I found a similar issue in the LlamaIndex repository that might be helpful: What is returned when the retrieved documents don't have information for the given question?. The suggested solution is to use more custom prompts and to define both a text QA template and a refine template. You can find more information on custom prompts at https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_prompts.html.

I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 1 year ago

@Ma-Fukudama Hmm, I think the way you built the index is not correct.

Right now, it seems that you are just inserting the names of files, not the actual content of files?

Here's a slight modified version

from llama_index import SimpleDirectoryReader

text_list = ['A.txt']
documents = SimpleDirectoryReader(input_files=text_list).load_data()

index = ListIndex.from_documents(documents, service_context=service_context)

Ma-Fukudama commented 1 year ago

@logan-markewich The part about creating the document was missing.

text_list = ['A.txt']
file_documents = [Document(text=t) for t in text_list]

The above source can display the file name, but it cannot be used as an index. I can't get the file name of the quoted source with the content you taught me.

logan-markewich commented 1 year ago

@Ma-Fukudama if you want both the filename and document text, you can set the filename_fn in SimpleDirectoryReader, you can configure it to set the Metadata to include the filename, and simple directory reader will take care of the text

from llama_index import SimpleDirectoryReader
filename_fn = lambda filename: {'file_name': filename}

# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

print(documents[0].text)
print(documents[0].metadata)

Or, maybe I don't understand your goal 😅

Ma-Fukudama commented 1 year ago

@logan-markewich I was able to accomplish what I wanted to do by using what you taught me. thank you. Other issues were also resolved by changing from ListIndex to VectorStoreIndex.

logan-markewich commented 1 year ago

Nice!

run-llama / llama_index