run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.47k stars 5.21k forks source link

[Question]: Is there a workaround to the fixed num_chilidren param in TreeIndex? #15611

Open c-jyoung opened 2 months ago

c-jyoung commented 2 months ago

Question Validation

Question

I am using a TreeIndex to consolidate information in my dataset and query it. However I noticed that a parent node will not be created if there are not exactly num_children children available. Is there a way for this parameter to be a maximum, i.e at most num_children nodes are present at each node.

This hard limit makes chunking a much more meticulous process in order for one document to be represented by one connected component, otherwise there are some "islands" present. Let me know if there are any other work arounds, or if I might need to alter my local copy of the source code to achieve this.

Is main reason this is in place so that one chunk is not overrepresented in higher summaries than the others? Could applying weights to parent nodes based on the amount of children they have help when creating a balanced grandparent node?

Here is an example of a graph with num_children = 3,and the parents are not consolidated into a grandparent node since they are 1 node short. tree_index

dosubot[bot] commented 2 months ago

Yes, there is a workaround to the fixed num_children parameter in TreeIndex to allow a parent node to be created with at most num_children children, rather than exactly num_children. The current implementation already supports this behavior. The key part of the code is:

# if under num_children limit, then we're fine
if len(self.index_graph.get_children(parent_node)) <= self.num_children:
    return
else:
    # perform consolidation
    ...

This condition checks if the number of children is less than or equal to num_children. If it is, no further action is taken, allowing the parent node to have at most num_children children [1].

Regarding applying weights to parent nodes based on the number of children they have, there are no existing methods or utilities in the TreeIndex class that support this functionality [2].

To continue talking to Dosu, mention @dosu.

c-jyoung commented 2 months ago

@dosu explain why in this example I provided in the image, there are only two parents with their 3 children, and a grandparent node isn't created. These nodes are all from the same document. I would want a single root node to serve as a parent to the two parents, and summarize the entirety of the document. If the code was like you said and the parents don't need exactly num_children nodes to be created, the grandparent would be created with the two parents even though num_children is 3.

Here is another example where the tree index fails to create one connected component for an entire document: In this case the num_children is 2 and the number of nodes is 11. image

node to children dictionary: {'628bdf3f-9c34-4f0b-9cdd-826be4840952': [], 'e60abd39-75a2-4fe6-af25-7b4d80ed598f': [], '327e571e-4744-421d-99c1-e4d9d21cfe27': [], '0dcc1f0a-6c0d-4e0f-88d6-3564fc92cff8': [], 'f96740e9-4a08-422d-a7e9-c8e0a37ebe20': [], '75602725-02f9-4ffe-adaf-2599a30ba76b': [], '6ff8f831-7d00-4404-8656-3af60ed7f0d7': ['628bdf3f-9c34-4f0b-9cdd-826be4840952', 'e60abd39-75a2-4fe6-af25-7b4d80ed598f'], 'e78efe55-edaa-41c4-845d-90c2d24c40b7': ['327e571e-4744-421d-99c1-e4d9d21cfe27', '0dcc1f0a-6c0d-4e0f-88d6-3564fc92cff8'], '86ea5f13-4116-421c-92c9-3153aa3032d6': ['f96740e9-4a08-422d-a7e9-c8e0a37ebe20', '75602725-02f9-4ffe-adaf-2599a30ba76b'], '968758e4-0ee0-483e-9a27-7d6cd6d4d2d5': ['6ff8f831-7d00-4404-8656-3af60ed7f0d7', 'e78efe55-edaa-41c4-845d-90c2d24c40b7'], '8c8d8941-40bb-405c-8cca-d52c1a18e2d1': ['86ea5f13-4116-421c-92c9-3153aa3032d6']}

In this case a grandparent is again not made for one side of the tree since there is only 1 parent node, thus there is no great-grandparent node that can be created to summarize the entire document. Explain how I can change my code to allow these parts of the tree to merge, so I have on connected component representing my document.

dosubot[bot] commented 2 months ago

The issue you're encountering arises because the TreeIndex class does not dynamically adjust the num_children parameter during the tree-building process. This parameter is fixed and set during the initialization of the TreeIndex instance, which can lead to the creation of "islands" or disconnected components if the number of nodes does not perfectly align with the num_children value.

To ensure a single connected component represents the entire document, you need to explicitly create and link grandparent nodes, even if the number of children does not exactly match the num_children parameter. Here is a modified version of the code that ensures a single connected component by explicitly creating and linking grandparent nodes:

# Create a list to store ingredient tuples
ingredient_tuple = []

# Create indices for each ingredient
for ingredient in ingredients:
    doc = Document(text=ingredient_name + ingredient_description, doc_id=ingredient_name)
    index = GPTSimpleVectorIndex([doc])
    ingredient_tuple.append((ingredient, doc, index))

# Create a list to store product tuples
product_tuple = []

# Create indices for each product using the ingredient indices
for product in products:
    ingredients = get_product_ingredients(product)
    product_index = TreeIndex([x[2] for x in ingredient_tuple if x[0] in ingredients])
    product_tuple.append((product, product_index))

# Create a grandparent node (portfolio index) that includes all product indices
portfolio_index = TreeIndex([x[1] for x in product_tuple])

# Now, portfolio_index represents a single connected component for the entire document

This code ensures that the portfolio_index is a single connected component that includes all product indices, which in turn include all ingredient indices. This approach aligns with the hierarchical structure of products containing ingredients, as demonstrated in the test_build_tree function from the LlamaIndex core tests, where a tree structure is built and verified to contain all nodes and their contents [1][2][3][4].

By explicitly creating and linking grandparent nodes, you can ensure that the entire document is represented by a single connected component, even if the number of children does not perfectly match the num_children parameter.

To continue talking to Dosu, mention @dosu.

c-jyoung commented 2 months ago

Would love to hear a staff member's input, if this is intentional it seems like a very large drawback to using TreeIndex, and that the num-children needs to be adjusted manually each time to ensure documents/data can be represented as one tree. Information that is semantically related, such as being in the same document would not be connected in this index which is worrying.