sdd / kiddo

Kiddo
Apache License 2.0
87 stars 14 forks source link

Deterministic Topology #40

Closed cavemanloverboy closed 1 year ago

cavemanloverboy commented 1 year ago

I have thought of a wonderful use case that could help some of us working on large simulation suites in cosmology. I'm not sure how this may affect performance, but this use case would require deterministic builds.

In theory, the topology of a kdtree (i.e. number of stems, leaves, number of items in each leaf, and the connectivity of all of these) can be identical between any point set which contains the same number of points. This is simply due to the fact that there will always exist a median that can split items into two groups of equal size +/-1, even if the split_val is identical for all points in a leaf.

In its current state, i've noticed that kiddo tree builds are not deterministic in the sense that two point sets of equal size do not produce the same topology, i.e. this test fails

    #[test]
    fn test_deterministic_topology() {
        const TEST_SIZE: usize = 50_000;
        const NUM_TESTS: usize = 100;

        // Populate a base test tree
        const K: usize = 2;
        let mut base = KdTree::<f32, usize, K, 16, u32>::with_capacity(TEST_SIZE);
        for i in 0..TEST_SIZE {
            base.add(&rand::random::<[f32; 2]>(), i);
        }

        for _ in 0..NUM_TESTS {
            // Populate a new test tree
            let mut tree = KdTree::<f32, usize, K, 16, u32>::with_capacity(TEST_SIZE);
            for i in 0..TEST_SIZE {
                tree.add(&rand::random::<[f32; 2]>(), i);
            }

            // Check topology
            assert_eq!(base.stems.len(), tree.stems.len());
            assert_eq!(base.leaves.len(), tree.leaves.len());
            for (bleaf, tleaf) in base.leaves.iter().zip(tree.leaves) {
                assert_eq!(bleaf.size, tleaf.size);
            }
        }
    }

Additionally, I've noticed that panic ensues when a bucket contains too many points with identical cartesian components. However, it should definitely be possible to still choose a median that splits things roughly down the middle. As an extreme example, although useless, it should be possible to construct a well-balanced tree with 1m points stacked on top of each other.

I need this feature. If you'd like to bring this upstream and benchmark it to make sure it doesn't affect performance let me know. Otherwise, I'll make my own fork for just this use case once V3 is out.

If you have any thoughts on how this could most easily be done, let me know.

sdd commented 1 year ago

Hi! Regarding the split for too many points with identical cartesian components. I think that many k-d tree implementations suffer from this issue. I've got a partial fix for this on another branch, where I move the split point so that, if the initial pivot point lands inside a group of points that have the same value on the splitting dimension, then I move the pivot. First I try to move it lower down the bucket, and if I get to the start of the bucket without getting to the end of the group of points with the same value then I move from the original pivot upwards through the bucket, only panicking if the bucket is effectively full of points with the same value in the splitting dimension.

I'm not sure of a way to handle that situation where all points in a bucket have the same value in the splitting dimension, since the buckets in Kiddo 2 and 3 are arrays and so can't be resized in this scenario. The only way to deal with it currently is to increase the bucket size until this doesn't happen for your dataset.

In theory you may be able to address this if this was not your first split by moving the split plane of an ancestor node and rebalancing but that would be very difficult.

sdd commented 1 year ago

This is simply due to the fact that there will always exist a median that can split items into two groups of equal size +/-1, even if the split_val is identical for all points in a leaf.

I'm not sure that this is the case? What if you have, say a bucket of 16 items that needs to be split, and 15 of them have the same value in the dimension that you'll be splitting on? How can you split that into two groups of equal size +/- 1?

sdd commented 1 year ago

I think it could be feasible to pre-set all of the stem values to create a balanced tree. You'd start with a vec of all of your points. You then sort that vec by their value in dimension zero and take the median as the value for the first split plane. You then sort the left half of the array by dimension 1, taking the median value as the split plane value for the left child of the root stem node. You sort the right half of the array by dimension 1, taking its median as the split plane value for the right child of the root stem node. You continue recursively until all of your split plane values are set for the desired number of stem / leaf nodes. Then you insert all of the items in the array into the tree.

Of course all this up-front sorting will be slower. You may potentially end up with some leaves needing to be split due to edge cases if you have groups of points with the same value in one dimension that are the median at a split. But, you could also choose the number of leaves so that you could end up with much higher average leaf fill levels as well, increasing efficiency.

I like the idea of this, I'll give it a go to see how it turns out.

cavemanloverboy commented 1 year ago

This is simply due to the fact that there will always exist a median that can split items into two groups of equal size +/-1, even if the split_val is identical for all points in a leaf.

I'm not sure that this is the case? What if you have, say a bucket of 16 items that needs to be split, and 15 of them have the same value in the dimension that you'll be splitting on? How can you split that into two groups of equal size +/- 1?

you put any given one of the 15 equal ones in the center, split into two groups of 8 and call it a day

sdd commented 1 year ago

But that leaves (no pun intended) half of those points on the wrong side of a split plane? Am I missing something? If your new split plane has a value of say x=15.0f32, and your traversal rule is that if your query val is >= 15 then you traverse to the right child, but some of the points with x=15.0f32 were in the left child, then you can never retrieve them by navigating the tree the normal way?

cavemanloverboy commented 1 year ago

But that leaves (no pun intended) half of those points on the wrong side of a split plane? Am I missing something? If your new split plane has a value of say x=15.0f32, and your traversal rule is that if your query val is >= 15 then you traverse to the right child, but some of the points with x=15.0f32 were in the left child, then you can never retrieve them by navigating the tree the normal way?

the logic for that will reduce to (pseudocode)

check_right(&mut best_dist);
if best_dist > dx {
    check_stem(&mut best_dist);
    check_left(&mut best_dist);
}

If your query point is at x=22.0f32, and a best_dist=1.0f32 is obtained after check_right, it does't need to check the stem or any of the points on that plane.

(In my impl of kdtrees, the stems are set on points. I think for kiddo this is just check_left)

sdd commented 1 year ago

Both v1 and v2 of Kiddo also check both sides, using similar logic. I think that second phase that walks back up the tree, checking the unvisited side if it could possibly contain a point closer to the query than the closest found so far, is what differentiates approx NN (which doesn't do this) from exact NN. So I suppose from a query standpoint, we'd get the expected result as we aren't doing approx NN. It just feels a bit wrong to me to put points in the wrong bucket I guess :-) I think I'd be ok with changing the behaviour when this happens from the current panic! to logging a warning instead and putting the points on the wrong side of the split.

cavemanloverboy commented 1 year ago

If I have some [1, 2, 2, 2, 5] and I use a binary tree (1D kdtree) with left = [1, 2], stem = 2, right = [2, 5], that feels right in my head lol. With a kiddo-type code, [1, 2, 2], [2, 5] would feel a little odd but not that odd.

cavemanloverboy commented 1 year ago

Just wanted to give you a heads up I wrote a special purpose library for this. Needed it now because I am giving a talk next week and another in August on the subject and wanted a proof-of-concept. I also needed zero memory overhead in addition to zero copy deserialization. If you don't need this or want this feature in kiddo, feel free to close the issue.

sdd commented 1 year ago

Nice, good luck with the talk. I will be adding a feature similar to this that I've been rolling around my head for a few days, based on the comment I made a few days ago.

This will be incorporated into kiddo v3, based on the work I've been doing in this branch.