ingowald / cudaKDTree

206 stars 18 forks source link

performance affected by distribution of points #2

Closed bricerebsamen closed 1 year ago

bricerebsamen commented 1 year ago

I played with your code and I found that the query speed (not the build speed) is affected by the distribution of points. Simply modifying generatePoints in the test code to achieve this (where p_ij is the j-th dimension of the i-th point, r is a random number between 0 and 1 obtained from drand48). Testing with 3M points and 25k query points.

For the time being in my application I will try to scale the points first, but I was wondering if this is a limitation of the algorithm, or whether this can be fixed?

ingowald commented 1 year ago

I'll have a look. Did you modify both query and data points or only one of them?

bricerebsamen commented 1 year ago

thanks

In that initial experiment I modified both of them the same way (just hack generatePoints()).

However I did some further experiments on scaling and got some other interesting results.

I started on the idea of scaling the points in the kdtree and in the query so that each dimension would roughly be over [0, 1]. i.e. generate points as above for the kdtree, then get the 4-dim bounding box, from which we can get the scale and offset in each dimension. Then whiten the kdtree points, then whiten with the same scale and offset the query points. Obviously in test this can be done without having to scan the kdtree points to get their BB...

I did that under several conditions, where the query points may have a slightly different offset and scale. In some cases some points will be outside the BB of the kdtree, in which case I found that the query was very slow. However if the query points are all within the BB of the kdtree then it's very fast.

In my actual project I'm doing ICP from a partial scan to a model. The partial scan may be entirely contained in the model but it may also be partially outside of the model (i.e. on the border).

I believe that's the common denominator here: the kdtree query (FCP btw) doesn't like points outside of the tree...

bricerebsamen commented 1 year ago

for instance, with the following

    for(auto& p : h_queries) {
      p.x = p.x * 0.8 + 0.1;
      p.y = p.y * 0.2 + 0.7;
      p.z = p.z * 0.5 + 0.3;
    }

(i.e. all the points are well within the tree), then the query is almost twice as fast as without this loop (400us vs 750us)

with

    for(auto& p : h_queries) {
      p.x = p.x * 1.2 - 0.1;
      p.y = p.y * 1.2 - 0.1;
      p.z = p.z * 1.2 - 0.1;
    }

it's 5 times slower: 4.5ms

and with

    for(auto& p : h_queries) {
      p.x = p.x * 2 - 1;
      p.y = p.y * 2 - 1;
      p.z = p.z * 2 - 1;
    }

it takes 1.5s!

(note that I'm working with 3D points, not 4D)

ingowald commented 1 year ago

That does all sound fishy indeed.... I'll have a look the moment I get home; with all the samples you sent it should be easy to find/reproduce. Thx, those are very good pointers!

bricerebsamen commented 1 year ago

for completeness: I'm working with 3D points, but the kdtree is built with float6 points (includes the normals too), but taking into account only the first 3 dims for the tree itself (i.e. the last 3 dimensions are silent payload)

btw, your fcp function doesn't handle that correctly: queryPoint is hardcoaded to a float4, and you need to add a dimension extractor to be able to compute the distance and such. For instance in distance(queryPoint,d_nodes[curr]), in my case, queryPoint is of dimension 3 and d_nodes[curr] is of dimension 6 but we only care about the first 3, so we need to make a float3 out of d_nodes[curr], something like *((query_point_t*)&d_nodes[curr].x), or using the TrivialPointInterface or even better, following what you did in buildTree provide a template argument for the point interface for each type of points...

ingowald commented 1 year ago

i'll eventually check both, but for now: which kernel did you see this on? fcp or knn?

ingowald commented 1 year ago

ignore, just saw you mentioend it above. my bad. fcp it is.

ingowald commented 1 year ago

Hmmmm. Actually this may all be "perfectly correct" after all. The "problem" is that kd trees aren't particularly good at cases where the query points are futher away. Say you had 2D data, uniform random [0,1]^2 data points, and a query at (.99,.95). In that case, you'll quickly traverse down to a leaf in the tree, and find, say, (.98,.96) as a first candidate point. From that point on, you know that you already have one that's only length((.01,.01) away, and that every other subtree of the tree can be culled right away. So very quickly your traversal operation will run out of subtrees that need traversing, and it will terminate quickly.

Now take a query at (3,3). it may still find the same data point at (.99,.95) (after all, it's very close to the top right edge), but the distance between query point and data point is still about length((2,2)), which is about 3.5 or so. now since the typical kd-tree query code (stackless or not) will only consider the projected distance in the given node's dimension, none of the subtrees will actually get culled. (even the lower-left data point t (0,0) would have a projected distnace of 3 in any dimension!).

For cases such as that you need to track the actual bounding box of each subtree during traversal, but your textbook kd-tree traversal code's (and thus, my sample code) don't do that. (and if you do, it's pretty nasty on the stack). The much better solutoin here is to use a BVH instead of a kd-tree - for that you know the actual (not projected) distnace to each subtree at any point in time - which makes this much better for such cases.

ingowald commented 1 year ago

btw: if you want i'd be more than happy to continue this discussion on email or slack - send me an email please!

bricerebsamen commented 1 year ago

@ingowald and I had a chat about this. In short this issue is expected with a kdtree: it will take a long time to search for the nearest neighbor as it will have to explore many side branches. I created a branch with some options to allow for approximate nearest neighbor search instead, which can be 2 orders of magnitude faster in some cases: #4. Closing this issue.