parthsarthi03 / raptor

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
https://arxiv.org/abs/2401.18059
MIT License
904 stars 129 forks source link

The reasons for stop clustering (num_layers) #53

Open SCULX opened 1 month ago

SCULX commented 1 month ago

Hello, Thank your excellent work, I appreciate the idea in the paper. When I read paper and check this code, I still couldn't solve one problem, which is following: \ in the paper:

...and summarization continues until further clustering becomes infeasible...

I want to know how you handled 'until further clustering becomes infeasible', because I think stopping clustering is a very difficult problem, so I carefully checked the code and debugged it, and found the following key codes:

if len(node_list_current_layer) <= self.reduction_dimension + 1:
             self.num_layers = layer
             logging.info(
                    f"Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: {layer}"
               )
             break

So why stop further building new layers under the condition of len (node_list_current_layer)<=self. reducion_dimension+1? I didn't find the reason in the paper, nor did I find the desired answer in the issues. \ It would be great if there were some reference materials.

parthsarthi03 commented 1 month ago

Thank you for your question. The stopping conditionlen(node_list_current_layer) <= self.reduction_dimension + 1 is used because clustering becomes challenging and unreliable when the number of points is less than or very close to the dimensionality of the data. I believe the specific library we used also throws an error when this happens. Hope that explains it.

SCULX commented 1 month ago

Thank you for your question. The stopping conditionlen(node_list_current_layer) <= self.reduction_dimension + 1 is used because clustering becomes challenging and unreliable when the number of points is less than or very close to the dimensionality of the data. I believe the specific library we used also throws an error when this happens. Hope that explains it.

Thank you for the reply, which helps me a lot.

So, can I understand this way:

the use of this condition len(node_list_current_layer) <= self.reduction_dimension + 1 to stop the clustering, is the implementation of technical problems, rather than document clustering itself, that is to say, in this condition to stop, some of the current summarized summary nodes, in fact, there may also still be some similarity, theoretically also can continue to clustering, just in the implementation of the technology, once again clustering is not feasible.