run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.51k stars 5.01k forks source link

[Question]: What's the recommended way to save summaries from multiple documents? #9828

Closed kilimchoi closed 8 months ago

kilimchoi commented 8 months ago

Question Validation

Question

I've been reading this https://docs.llamaindex.ai/en/stable/examples/usecases/City_Analysis-Decompose-KeywordTable.html and I've been trying to do something similar by using the index saved in the vector db. Is it generally recommended to save the document text as well as its summary in the metadata if we want to later fetch it to initialize a graph? Or is it better to use the same summary for each document regardless of the document's content? Is there a resource I can take a look to load the documents & summaries from the vector db?

dosubot[bot] commented 8 months ago

🤖

Hello @kilimchoi! :wave:

I'm Dosu, a friendly bot here to help you with your questions and issues related to LlamaIndex while we wait for a human maintainer. I'm here to guide you, help solve bugs, and assist you in becoming an even better contributor! Feel free to ask me anything you need. Let's make your LlamaIndex experience as smooth as possible together. :smiley:

In the LlamaIndex framework, it is generally recommended to save both the document text and its unique summary for later use in initializing a graph. This is based on the functionality of the DocumentSummaryIndex class, which generates a unique summary for each document and stores it in the summary_node_dict.

The unique summary for each document is generated by the synthesize method of the _response_synthesizer object. This method takes a query and a list of nodes (representing documents) as input and generates a response based on the content of the nodes. The generated response is then used as the summary for the document.

Here is the relevant code snippet from the DocumentSummaryIndex class:

for doc_id, nodes in iterable_with_progress:
    nodes_with_scores = [NodeWithScore(node=n) for n in nodes]
    # get the summary for each doc_id
    summary_response = self._response_synthesizer.synthesize(
        query=self._summary_query,
        nodes=nodes_with_scores,
    )
    summary_response = cast(Response, summary_response)
    summary_node_dict[doc_id] = TextNode(
        text=summary_response.response,
        relationships={
            NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id)
        },
    )
    self.docstore.add_documents([summary_node_dict[doc_id]])
    logger.info(
        f"> Generated summary for doc {doc_id}: " f"{summary_response.response}"
    )

(Source: llama_index/indices/document_summary/base.py)

This indicates that each document has its own unique summary, and the same summary is not used for each document regardless of the document's content. Therefore, it would be beneficial to save both the document text and its summary for later use in initializing a graph.

I hope this answers your question. If you have any further queries, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

kilimchoi commented 8 months ago

Also what's the recommended approach to perform complex queries over multiple documents? noticed that it's pretty slow if i use the composable graph.

logan-markewich commented 8 months ago

The composable graph has not been updated in a hot minute (and will likely be removed at a later date)

Current approaches for multi-doc revolve around a few things, mostly they boil down to providing methods to route queries to the proper locations

https://docs.llamaindex.ai/en/stable/module_guides/querying/router/root.html https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine.html https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents.html https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes.html

kilimchoi commented 8 months ago

The composable graph has not been updated in a hot minute (and will likely be removed at a later date)

Current approaches for multi-doc revolve around a few things, mostly they boil down to providing methods to route queries to the proper locations

https://docs.llamaindex.ai/en/stable/module_guides/querying/router/root.html https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine.html https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents.html https://docs.llamaindex.ai/en/stable/examples/retrievers/recursive_retriever_nodes.html

Thanks