run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.16k stars 4.62k forks source link

[Question]: Summarize a long document? #13071

Open chigkim opened 2 months ago

chigkim commented 2 months ago

Question Validation

Question

I'm trying to summarize a long document. I've loaded a pdf document which got splitted into 74 documents by SimpleDirectoryReader. As expected, DocumentSummaryIndex.from_documents goes through each document, and created a summary via the selected llm. However, query_engine.query("Summarize the documents") only selects one node and sends to LLM to summarize the document. How do you make it so that it summarizes the entire document using all nodes? Is there a different QueryEngine or Index I should use for this purpose?

documents = SimpleDirectoryReader(input_files=path).load_data()
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
doc_summary_index = DocumentSummaryIndex.from_documents(documents, response_synthesizer=response_synthesizer)
query_engine = doc_summary_index.as_query_engine(response_mode="tree_summarize", use_async=True)
response = query_engine.query("Summarize the documents")
print(response.response)
print(len(response.source_nodes))

I'm sure there got to be a tutorial on this, but I couldn't find one. The closest one I could find was Document Summary Index - LlamaIndex. However, it's not exactly what I'm looking for. It's more of question and answer system based via embedding/llm. Also, I'm not sure what the advantage of using DocumentSummaryIndex instead of VectorStoreIndex for Q&A.

Thanks so much for your help!

dosubot[bot] commented 2 months ago

To achieve a comprehensive summary of your entire document split into 74 parts, you should consider using the QASummaryQueryEngineBuilder from LlamaIndex. This builder allows for the creation of a query engine that is capable of handling both question answering and summarization tasks across multiple document segments. Here's a streamlined approach:

  1. Prepare Your Documents: Make sure each document part is properly represented as a Document instance.

  2. Instantiate QASummaryQueryEngineBuilder: This builder is designed to accommodate customization, including the use of a specific Large Language Model (LLM), embedding models, and more.

  3. Build the Query Engine: Utilize the build_from_documents method to process your documents and create a RouterQueryEngine. This engine is configured for both QA and summarization tasks, enabling it to generate summaries across all document parts.

  4. Perform Summarization Query: With the RouterQueryEngine, you can now query for a summary that encompasses all parts of your document.

Regarding the difference between DocumentSummaryIndex and VectorStoreIndex, the former is tailored for summarizing documents, ideal for generating quick insights from large texts. In contrast, VectorStoreIndex is focused on storing and manipulating high-dimensional vector data, which is more aligned with machine learning applications where vector operations are key.

For implementation details and further guidance, you might want to explore the QASummaryQueryEngineBuilder documentation and examples provided by LlamaIndex. This will give you a clearer understanding of how to set up and use the query engine for document summarization across multiple segments.

#### Details **Sources** - [llama-index-legacy/llama_index/legacy/indices/document_summary/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/indices/document_summary/base.py) - [llama-index-core/llama_index/core/composability/joint_qa_summary.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/composability/joint_qa_summary.py) - [llama-index-legacy/llama_index/legacy/composability/joint_qa_summary.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/composability/joint_qa_summary.py) - [llama-index-core/llama_index/core/data_structs/struct_type.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/data_structs/struct_type.py) - [llama-index-legacy/llama_index/legacy/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/base.py) - [docs/docs/examples/data_connectors/simple_directory_reader.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb) - [docs/docs/examples/data_connectors/simple_directory_reader_parallel.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader_parallel.ipynb) - [llama-index-core/llama_index/core/response_synthesizers/__init__.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/response_synthesizers/__init__.py) - [llama-index-core/llama_index/core/response_synthesizers/factory.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/response_synthesizers/factory.py) - [llama-index-legacy/llama_index/legacy/response_synthesizers/factory.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/response_synthesizers/factory.py) - [llama-index-legacy/llama_index/legacy/response_synthesizers/tree_summarize.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/response_synthesizers/tree_summarize.py) - [docs/docs/module_guides/querying/response_synthesizers/index.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/querying/response_synthesizers/index.md)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

chigkim commented 2 months ago

Thanks @dosu. How do you use Ollama as llm for QASummaryQueryEngineBuilder? It wants to use OpenAI instead even though I specified llm parameter.

llm = Ollama(model="llama3", request_timeout=600, base_url=host, additional_kwargs=options)
query_engine_builder = QASummaryQueryEngineBuilder(llm=llm)
ValueError: No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys
dosubot[bot] commented 2 months ago

To resolve the issue where specifying Ollama as the LLM for QASummaryQueryEngineBuilder results in an error requesting an OpenAI API key, it appears that the system defaults to using OpenAI's LLM, possibly due to how the ServiceContext is initialized or used within your code. Here's how you can explicitly use your local LLM and avoid this error:

  1. Explicitly Set the LLM in ServiceContext: Ensure that your local LLM (Ollama in this case) is correctly set in the ServiceContext. This step is crucial to override any default behavior that attempts to use OpenAI's LLM.
from llama_index.llms import LangChainLLM  # Adjust this import to match your LLM's specific class
from llama_index import ServiceContext

# Initialize your LLM here, replacing LangChainLLM() with your Ollama initialization
llm = LangChainLLM(model="llama3", request_timeout=600, base_url=host, additional_kwargs=options)
service_context = ServiceContext(llm=llm)
  1. Set Global Service Context: To ensure that your entire application uses this ServiceContext, set it globally at the start of your application. This approach helps in making sure that all parts of your pipeline are aligned to use the specified LLM.
from llama_index import set_global_service_context

set_global_service_context(service_context)

By following these steps, you should be able to bypass the error related to the OpenAI API key and use Ollama as your LLM for the QASummaryQueryEngineBuilder. This method ensures that your specified LLM is used throughout your application, avoiding any unintended fallbacks to default settings that might cause issues.

#### Details **Sources** - [Can DocumentSummaryIndex be used with local langchain llm?](https://github.com/run-llama/llama_index/discussions/9780) - [How to use PandasQueryEngine with other LLM](https://github.com/run-llama/llama_index/discussions/9619) - [[Question]: Joint QA Summary Query Engine no openAI](https://github.com/run-llama/llama_index/issues/12749) - [[Question]: ValueError: No API key found for OpenAI.](https://github.com/run-llama/llama_index/issues/10094) - [openai.error.AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = ', or you can set the environment variable OPENAI_API_KEY=). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = '. You can generate API keys in the OpenAI web interface. ](https://github.com/run-llama/llama_index/issues/6756) - [[Bug]: The `_get_llm_metadata` function reports an error: "Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter." (type=value_error)](https://github.com/run-llama/llama_index/issues/6513) - [llama-index-legacy/llama_index/legacy/llms/ollama.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/ollama.py) - [llama-index-integrations/llms/llama-index-llms-ollama/llama_index/llms/ollama/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-ollama/llama_index/llms/ollama/base.py) - [llama-index-core/llama_index/core/composability/joint_qa_summary.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/composability/joint_qa_summary.py) - [llama-index-integrations/llms/llama-index-llms-ollama/tests/test_llms_ollama.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-ollama/tests/test_llms_ollama.py) - [llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-ollama/llama_index/multi_modal_llms/ollama/__init__.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-ollama/llama_index/multi_modal_llms/ollama/__init__.py) - [llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/query_engine/sub_question_query_engine.py) - [llama-index-core/llama_index/core/llms/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/llms/utils.py) - [llama-index-legacy/llama_index/legacy/llms/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/utils.py) - [llama-index-core/llama_index/core/settings.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/settings.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.