run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.29k stars 4.96k forks source link

[Question]: Splitting of metadata in `CompactAndRefine` #15054

Open theta-lin opened 1 month ago

theta-lin commented 1 month ago

Question Validation

Question

When using CompactAndRefine as a response synthesizer, suppose that I have retrieved a document/node, which consists of two parts, metadata and text. When presenting the node to CompactAndRefine, one possible situation is the the node is too large to fit in the context window, and therefore got split in the middle of the text. Thus, the first LLM call might look like this:

prompt
metadata
text part 1

And the second LLM call would look like this:

prompt
previous summary
text part 2

However, this would mean that the second LLM call does not actually provide the LLM with the metadata of the node (unless the LLM is smart enough to keep it in the previous summary). Ideally, I think the second LLM call should look like this:

prompt
previous summary
metadata
text part 2

The relevant code are the follows. You can see that the nodes are simply converted to texts, concatenated together, then split into chunks. https://github.com/run-llama/llama_index/blob/d3abf789800f4366fec7f607be15804a4a72ee52/llama-index-core/llama_index/core/response_synthesizers/base.py#L251-L257 https://github.com/run-llama/llama_index/blob/d3abf789800f4366fec7f607be15804a4a72ee52/llama-index-core/llama_index/core/response_synthesizers/compact_and_refine.py#L50-L57 https://github.com/run-llama/llama_index/blob/d3abf789800f4366fec7f607be15804a4a72ee52/llama-index-core/llama_index/core/indices/prompt_helper.py#L276-L280

I think https://github.com/run-llama/llama_index/blob/d3abf789800f4366fec7f607be15804a4a72ee52/llama-index-core/llama_index/core/node_parser/text/token.py#L109-L110 might be intended for this purpose. However, this method is not actually used anywhere in the code besides the tests.

Therefore, what would be the best way to preserve node metadata after splitting a node across multiple chunks in implementation and how CompactAndRefine might be improved?

dosubot[bot] commented 1 month ago

To best preserve node metadata after splitting a node across multiple chunks, you can use the CompactAndRefine response synthesizer in combination with the QueryFusionRetriever. This approach ensures that metadata is maintained and utilized effectively during the retrieval and synthesis process.

Here is an example of how you can improve the CompactAndRefine implementation to achieve this:

from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# Create vector and text retrievers
vector_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="default",
    similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="sparse",
    similarity_top_k=5,  # interchangeable with sparse_top_k in this context
)

# Use QueryFusionRetriever to combine vector and text retrievers
retriever = QueryFusionRetriever(
    [vector_retriever, text_retriever],
    similarity_top_k=5,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=False,
)

# Use CompactAndRefine to synthesize responses
response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# Query the engine
response = query_engine.query(
    "Who does Paul Graham think of with the word schtick, and why?"
)
print(response)

The CompactAndRefine class refines responses across compact text chunks, ensuring that the metadata associated with each node is preserved and utilized during the retrieval and synthesis process. The _make_compact_text_chunks method helps in repacking the text chunks to fit within the prompt limitations, ensuring that the context is maintained effectively [1].

To further enhance the preservation of metadata, you can use the SemanticDoubleMergingSplitterNodeParser class from the LlamaIndex library. This class is designed to split text into chunks while being aware of the metadata, ensuring that the metadata is included and preserved in each chunk.

Here is an example of how to use the SemanticDoubleMergingSplitterNodeParser:

from llama_index.core.node_parser.text.semantic_double_merging_splitter import (
    SemanticDoubleMergingSplitterNodeParser,
    LanguageConfig
)

# Initialize the SemanticDoubleMergingSplitterNodeParser with appropriate parameters
language_config = LanguageConfig(language="english", spacy_model="en_core_web_md")
node_parser = SemanticDoubleMergingSplitterNodeParser(
    language_config=language_config,
    initial_threshold=0.6,
    appending_threshold=0.8,
    merging_threshold=0.8,
    max_chunk_size=1000,
    merging_range=1,
)

# Split text with metadata awareness
text = "Your long text here..."
metadata_str = "Metadata information here..."
chunks = node_parser.parse(text, metadata_str)

# Process the chunks as needed
for chunk in chunks:
    print(chunk)

By using the SemanticDoubleMergingSplitterNodeParser in conjunction with the CompactAndRefine response synthesizer, you can ensure that the metadata is preserved and effectively utilized during the retrieval and synthesis process, providing more accurate and contextually relevant responses [2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Since by the time it splits, all it has is text, there's no easy way really to ensure metadata doesn't get split. Overall though I think this is mostly a lower-priority edge case? Especially as LLM inputs get larger and larger

theta-lin commented 1 month ago

Yes, I agree that this is of a lower priority when the LLM has a larger context window. However, I think it is still possible to take metadata into account when splitting, but this either requires CompactAndRefine to use split_text_metadata_aware() directly or support the handling of metadata in PromptHelper. I would work on a solution if needs arise.