imglib / imglib2

A generic next-generation Java library for image processing
http://imglib2.net/
Other
293 stars 93 forks source link

Revise KDTree implementation to store coordinates in dense arrays #333

Closed tpietzsch closed 3 months ago

tpietzsch commented 1 year ago

This PR revises the implementation of KDTree.

The old implementation created KDTreeNode objects for each point, with references to children to represent the tree structure.

The new implementation does not store explicit links. Instead, coordinates of all points are stored in a flat double[] array (or alternatively one double[] array for each dimension). The tree structure is represented purely by the order of points in the array.

The nodes in the tree are arranged in Eytzinger layout (children of i are at 2i and 2i+1). Additionally, pivot indices are chosen such that "leaf layers" are filled from the left. For example 10 nodes will always be arranged like this:

             0
          /     \
        1         2
      /   \     /   \
     3     4   5     6
    / \   /
   7   8 9

never like this:

             0
          /     \
        1         2
      /   \     /   \
     3     4   5     6
    /         /     /
   7         8     9

By choosing pivots in this way, the tree structure is fully determined. For every node index, the child indices can be calculated without dependent reads. And iff the calculated child index is less than the number of nodes, the child exists.

Less memory overhead

This has considerably less memory overhead: For example we store 1,000,000 3D RealPoints into a KDTree. The list of points (ArrayList object and the RealPoint objects) requires ~82MB of memory. The old KDTree added ~90MB on top of that, just for storing the tree structure. Despite copying the point coordinates, the new KDTree only adds ~22MB (which is basically just the double[3_000_000] array required for storing the coordinates.)

Better performance

Performance of building trees is 2x better than the old implementation. Neighbor searches are slightly faster. By using a different kthElement implementation, this PR also resolves #326. (building the tree is fast, also in the case that many points have identical coordinates in a particular dimension).

Easy Serialization

Note that these KDTrees can be serialized very efficiently (for example into N5), by just storing/loading the flattened coordinate array as is (avoiding the overhead of serializing the object structure of the interlinked KDTreeNodes).

StephanPreibisch commented 1 year ago

Just tested this using BigStitcher's interest point-based registration, works!

StephanPreibisch commented 1 year ago

check this out @minnerbe ... we can make the math how many points we can maximally support

tpietzsch commented 12 months ago

check this out @minnerbe ... we can make the math how many points we can maximally support

You should be able to use 2^31-8 points (2,147,483,640).

StephanPreibisch commented 12 months ago

Would it be too hard to back it with a RandomAccessibleInterval instead? ☺️

tpietzsch commented 12 months ago

Lets talk about it when you have used it for 2^31-8 points

minnerbe commented 12 months ago

Looks great! The number of points that can be stored depends on the storage layout, for FLAT, this is (2^31-8) / numDimensions (~700mi in 3D). Is there any (noticeable) performance difference between FLAT and NESTED, @tpietzsch? It seems like the improvement factors you provided were done with the FLAT layout.

Would it be too hard to back it with a RandomAccessibleInterval instead? relaxed

Do you mean positions, values, or both? This doesn't look too hard, though, since there are dedicated wrappers for accessing both of those data structures.

Also, I noticed two things while glancing over the code:

tpietzsch commented 12 months ago

Looks great! The number of points that can be stored depends on the storage layout, for FLAT, this is (2^31-8) / numDimensions (~700mi in 3D). Is there any (noticeable) performance difference between FLAT and NESTED, @tpietzsch? It seems like the improvement factors you provided were done with the FLAT layout.

Yes, improvements are with FLAT layout. (You can experiment for yourself a bit with KDTreeBenchmark. The options for forcing NESTED layout are there for building KDTreeData somehwere, but not pulled all the way through to KDTree constructor, but it shouldn't be too hard to tweak.)

I did benchmarks for FLAT vs NESTED. Searches are a bit faster for FLAT, probably because it's better packed for caching.

I found this in some old notes. I think it could be branch kdtree-benchmark-variations in imglib2-ij repo:

Result "net.imglib2.kdtree.KDTreeBenchmark.nearestNeighborSearch":
Benchmark                              (numDataVertices)  Mode  Cnt     Score    Error  Units
KDTreeBenchmark.nearestNeighborSearch              10000  avgt    8     2,952 ±  0,008  ms/op
KDTreeBenchmark.nearestNeighborSearch             100000  avgt    8    12,747 ±  0,036  ms/op
KDTreeBenchmark.nearestNeighborSearch           10000000  avgt    8   521,957 ±  1,689  ms/op
KDTreeBenchmark.nearestNeighborSearch          100000000  avgt    8  2680,278 ± 13,609  ms/op

Result "net.imglib2.kdtree.KDTreeBenchmark.nearestNeighborSearchFlat":
Benchmark                                  (numDataVertices)  Mode  Cnt     Score    Error  Units
KDTreeBenchmark.nearestNeighborSearchFlat              10000  avgt    8     1,636 ±  0,048  ms/op
KDTreeBenchmark.nearestNeighborSearchFlat             100000  avgt    8     6,904 ±  0,112  ms/op
KDTreeBenchmark.nearestNeighborSearchFlat           10000000  avgt    8   291,768 ±  1,341  ms/op
KDTreeBenchmark.nearestNeighborSearchFlat          100000000  avgt    8  2437,730 ± 73,116  ms/op

But I don't remember what state this was (I would be surprised if the difference is still that much).

Would it be too hard to back it with a RandomAccessibleInterval instead? relaxed

Do you mean positions, values, or both? This doesn't look too hard, though, since there are dedicated wrappers for accessing both of those data structures.

Yes, it should be easy to do. I took as the subtext form @StephanPreibisch comment: "... because then we could use CellImg and have as many points as we want". And then the subtext to my reply would be "... because, try to build a tree with 2G points and see how long that takes, because it may take a prohibitively long time."

For using >2G points besides backing the positions with a CellImg, we would need to change int indices to long in many places, and that would negatively impact performance for "normal" scenarios, so I would not just do it "because we can".

My preferred solution for >2G points would be a combination oct-tree with kd-trees in the leafs. This is more efficient to build (no sorting of the whole array of points, quite easy to parallelize). And it should be also possible to speed up search for common scenarios (for example for rendering, we often query nearest neighbors for many nearby query points).

Also, I noticed two things while glancing over the code:

  • Why does KDTreeData store the position array? It looks to me as if all access was done in KDTreeImpl, which also stores the positions. Imo, the conditional null assignment to positions or flatPositions in KDTreeData looks like the position arrays don't really belong there.

KDTreeData has all data (positions arrays and values List/RandomAccessibleInterval). KDTreeData is what you would serialize/deserialize.

KDtreeImpl is the "coordinate part" of KDTree<T>, that is, no values associated, not typed on T. But all the "kdtree logic" happens on this level. NearestNeighborSearchOnKDTreeImpl is at the same level for example. The part above (KDTree<T>, NearestNeighborSearchOnKDTree<T>) is pretty shallow wrappers).

I think this split makes sense.

  • The methods in KDTreeImpl suggest to me that they return a KDTreeNode, when they only return an int. It would be more consistent to me to name them leftIndex, rightIndex and so on.

Yes, good point

minnerbe commented 12 months ago

Thanks for the (very) detailed answer!

I run the benchmarks locally with 1M points:

Benchmark (Nested)                      Mode  Cnt    Score    Error  Units
KDTreeBenchmark.createKDTree            avgt    8  401.968 ± 24.802  ms/op
KDTreeBenchmark.kNearestNeighborSearch  avgt    8   98.354 ±  7.822  ms/op
KDTreeBenchmark.nearestNeighborSearch   avgt    8   67.467 ±  1.867  ms/op
KDTreeBenchmark.radiusNeighborSearch    avgt    8   70.195 ± 17.052  ms/op
Benchmark (Flat)                        Mode  Cnt    Score    Error  Units
KDTreeBenchmark.createKDTree            avgt    8  267.109 ±  7.922  ms/op
KDTreeBenchmark.kNearestNeighborSearch  avgt    8   82.138 ± 27.582  ms/op
KDTreeBenchmark.nearestNeighborSearch   avgt    8   62.839 ±  2.145  ms/op
KDTreeBenchmark.radiusNeighborSearch    avgt    8   64.165 ±  3.959  ms/op

You're absolutely right, the difference is not as stark anymore. However, it's still worth noting that the nested layout (almost) doubles the memory consumption in 3D (38 = 24 bytes vs 4 or 8 (reference on system with <32G / >32G, respectively) + 16 (array) + 38 = 44 or 48 bytes per coordinate).

For using >2G points besides backing the positions with a CellImg, we would need to change int indices to long in many places, and that would negatively impact performance for "normal" scenarios, so I would not just do it "because we can".

Thanks, that was the part that I overlooked! In my head, backing the positions with a CellImg didn't infer with the existing implementations.

[...] I think this split makes sense.

I do see your point. Still, my feeling is that a split into an object holding the positions and one holding the values could also be beneficial, especially when adding a backing option for positions and/or values. Right now, the number of constructors of KDTreeData seems to be a product of the number of backing options for positions and values, respectively; however, all non-creation methods in KDTreeData only use positions or values, never both. This is, of course, just from an outside perspective, so there might be arguments against my suggestions that I just overlook.

tpietzsch commented 12 months ago

I run the benchmarks locally with 1M points:

Benchmark (Nested)                      Mode  Cnt    Score    Error  Units
KDTreeBenchmark.createKDTree            avgt    8  401.968 ± 24.802  ms/op
KDTreeBenchmark.kNearestNeighborSearch  avgt    8   98.354 ±  7.822  ms/op
KDTreeBenchmark.nearestNeighborSearch   avgt    8   67.467 ±  1.867  ms/op
KDTreeBenchmark.radiusNeighborSearch    avgt    8   70.195 ± 17.052  ms/op
Benchmark (Flat)                        Mode  Cnt    Score    Error  Units
KDTreeBenchmark.createKDTree            avgt    8  267.109 ±  7.922  ms/op
KDTreeBenchmark.kNearestNeighborSearch  avgt    8   82.138 ± 27.582  ms/op
KDTreeBenchmark.nearestNeighborSearch   avgt    8   62.839 ±  2.145  ms/op
KDTreeBenchmark.radiusNeighborSearch    avgt    8   64.165 ±  3.959  ms/op

Yes, cool! That looks as I would have expected. Which Java version did you benchmark it with?

You're absolutely right, the difference is not as stark anymore. However, it's still worth noting that the nested layout (almost) doubles the memory consumption in 3D (3_8 = 24 bytes vs 4 or 8 (reference on system with <32G / >32G, respectively) + 16 (array) + 3_8 = 44 or 48 bytes per coordinate).

I think you have the nesting order wrong. for 1000 3D points you would have a double[3][1000], so one double[3][] array with 3 elements, and each element a double[1000] array. So the memory overhead is not much. It's more about cache locality. (And perhaps one more indirect lookup for getting the double[] for dimension d and then looking up the value vs directly looking up the value in the flattened array. But I don't know how much of that is "speculated away".)

[...] I think this split makes sense.

I do see your point. Still, my feeling is that a split into an object holding the positions and one holding the values could also be beneficial, especially when adding a backing option for positions and/or values. Right now, the number of constructors of KDTreeData seems to be a product of the number of backing options for positions and values, respectively; however, all non-creation methods in KDTreeData only use positions or values, never both. This is, of course, just from an outside perspective, so there might be arguments against my suggestions that I just overlook.

That is a good point. Splitting into into an object holding the positions and one holding the values makes sense. Probably having KDTreeData as a container for the two is still useful, but not sure. Looking at the code again, it is also not ideal that KDTreeData does the lazy bounding box computation, that goes a bit against the "pure data container" idea, but also maybe it's pragmatically okay.

If you want to play around with splitting up KDTreeData, that would be cool. There is more related work in https://github.com/imglib/imglib2-algorithm/tree/kdtree (more searches) and https://github.com/tpietzsch/imglib2-pointcloud (N5 serialization experiments). It would be cool if you pull these along for refactoring.

minnerbe commented 12 months ago

Yes, cool! That looks as I would have expected. Which Java version did you benchmark it with?

Azul Zulu JDK 17.

I think you have the nesting order wrong.

Jep, my bad. So, the memory overhead is negligible and the runtime is also only slightly worse. That's great!

Looking at the code again, it is also not ideal that KDTreeData does the lazy bounding box computation, that goes a bit against the "pure data container" idea, but also maybe it's pragmatically okay.

The "correct" thing (modeling-wise) would probably be to eagerly compute the bounding box and pass it to the constructor. But from a performance perspective, this is strictly worse than the current lazy computation, which is definitely "pragmatically okay". Also, KDTreeData seems like worth keeping for serialization?

If you want to play around with splitting up KDTreeData, that would be cool. There is more related work in https://github.com/imglib/imglib2-algorithm/tree/kdtree (more searches) and https://github.com/tpietzsch/imglib2-pointcloud (N5 serialization experiments). It would be cool if you pull these along for refactoring.

Sounds good, I'll give it a try in the next few days.

tpietzsch commented 3 months ago

I incorporated changes from #334