Closed tpietzsch closed 3 months ago
Just tested this using BigStitcher's interest point-based registration, works!
check this out @minnerbe ... we can make the math how many points we can maximally support
check this out @minnerbe ... we can make the math how many points we can maximally support
You should be able to use 2^31-8 points (2,147,483,640).
Would it be too hard to back it with a RandomAccessibleInterval instead? ☺️
Lets talk about it when you have used it for 2^31-8 points
Looks great! The number of points that can be stored depends on the storage layout, for FLAT
, this is (2^31-8) / numDimensions (~700mi in 3D). Is there any (noticeable) performance difference between FLAT
and NESTED
, @tpietzsch? It seems like the improvement factors you provided were done with the FLAT layout.
Would it be too hard to back it with a RandomAccessibleInterval instead? relaxed
Do you mean positions, values, or both? This doesn't look too hard, though, since there are dedicated wrappers for accessing both of those data structures.
Also, I noticed two things while glancing over the code:
KDTreeData
store the position array? It looks to me as if all access was done in KDTreeImpl
, which also stores the positions. Imo, the conditional null
assignment to positions
or flatPositions
in KDTreeData
looks like the position arrays don't really belong there.KDTreeImpl
suggest to me that they return a KDTreeNode
, when they only return an int
. It would be more consistent to me to name them leftIndex
, rightIndex
and so on. Looks great! The number of points that can be stored depends on the storage layout, for
FLAT
, this is (2^31-8) / numDimensions (~700mi in 3D). Is there any (noticeable) performance difference betweenFLAT
andNESTED
, @tpietzsch? It seems like the improvement factors you provided were done with the FLAT layout.
Yes, improvements are with FLAT layout. (You can experiment for yourself a bit with KDTreeBenchmark. The options for forcing NESTED layout are there for building KDTreeData somehwere, but not pulled all the way through to KDTree constructor, but it shouldn't be too hard to tweak.)
I did benchmarks for FLAT
vs NESTED
. Searches are a bit faster for FLAT, probably because it's better packed for caching.
I found this in some old notes. I think it could be branch kdtree-benchmark-variations
in imglib2-ij
repo:
Result "net.imglib2.kdtree.KDTreeBenchmark.nearestNeighborSearch":
Benchmark (numDataVertices) Mode Cnt Score Error Units
KDTreeBenchmark.nearestNeighborSearch 10000 avgt 8 2,952 ± 0,008 ms/op
KDTreeBenchmark.nearestNeighborSearch 100000 avgt 8 12,747 ± 0,036 ms/op
KDTreeBenchmark.nearestNeighborSearch 10000000 avgt 8 521,957 ± 1,689 ms/op
KDTreeBenchmark.nearestNeighborSearch 100000000 avgt 8 2680,278 ± 13,609 ms/op
Result "net.imglib2.kdtree.KDTreeBenchmark.nearestNeighborSearchFlat":
Benchmark (numDataVertices) Mode Cnt Score Error Units
KDTreeBenchmark.nearestNeighborSearchFlat 10000 avgt 8 1,636 ± 0,048 ms/op
KDTreeBenchmark.nearestNeighborSearchFlat 100000 avgt 8 6,904 ± 0,112 ms/op
KDTreeBenchmark.nearestNeighborSearchFlat 10000000 avgt 8 291,768 ± 1,341 ms/op
KDTreeBenchmark.nearestNeighborSearchFlat 100000000 avgt 8 2437,730 ± 73,116 ms/op
But I don't remember what state this was (I would be surprised if the difference is still that much).
Would it be too hard to back it with a RandomAccessibleInterval instead? relaxed
Do you mean positions, values, or both? This doesn't look too hard, though, since there are dedicated wrappers for accessing both of those data structures.
Yes, it should be easy to do. I took as the subtext form @StephanPreibisch comment: "... because then we could use CellImg and have as many points as we want". And then the subtext to my reply would be "... because, try to build a tree with 2G points and see how long that takes, because it may take a prohibitively long time."
For using >2G points besides backing the positions with a CellImg, we would need to change int
indices to long
in many places, and that would negatively impact performance for "normal" scenarios, so I would not just do it "because we can".
My preferred solution for >2G points would be a combination oct-tree with kd-trees in the leafs. This is more efficient to build (no sorting of the whole array of points, quite easy to parallelize). And it should be also possible to speed up search for common scenarios (for example for rendering, we often query nearest neighbors for many nearby query points).
Also, I noticed two things while glancing over the code:
- Why does
KDTreeData
store the position array? It looks to me as if all access was done inKDTreeImpl
, which also stores the positions. Imo, the conditionalnull
assignment topositions
orflatPositions
inKDTreeData
looks like the position arrays don't really belong there.
KDTreeData
has all data (positions arrays and values List/RandomAccessibleInterval).
KDTreeData
is what you would serialize/deserialize.
KDtreeImpl
is the "coordinate part" of KDTree<T>
, that is, no values associated, not typed on T
.
But all the "kdtree logic" happens on this level. NearestNeighborSearchOnKDTreeImpl
is at the same level for example.
The part above (KDTree<T>
, NearestNeighborSearchOnKDTree<T>
) is pretty shallow wrappers).
I think this split makes sense.
- The methods in
KDTreeImpl
suggest to me that they return aKDTreeNode
, when they only return anint
. It would be more consistent to me to name themleftIndex
,rightIndex
and so on.
Yes, good point
Thanks for the (very) detailed answer!
I run the benchmarks locally with 1M points:
Benchmark (Nested) Mode Cnt Score Error Units
KDTreeBenchmark.createKDTree avgt 8 401.968 ± 24.802 ms/op
KDTreeBenchmark.kNearestNeighborSearch avgt 8 98.354 ± 7.822 ms/op
KDTreeBenchmark.nearestNeighborSearch avgt 8 67.467 ± 1.867 ms/op
KDTreeBenchmark.radiusNeighborSearch avgt 8 70.195 ± 17.052 ms/op
Benchmark (Flat) Mode Cnt Score Error Units
KDTreeBenchmark.createKDTree avgt 8 267.109 ± 7.922 ms/op
KDTreeBenchmark.kNearestNeighborSearch avgt 8 82.138 ± 27.582 ms/op
KDTreeBenchmark.nearestNeighborSearch avgt 8 62.839 ± 2.145 ms/op
KDTreeBenchmark.radiusNeighborSearch avgt 8 64.165 ± 3.959 ms/op
You're absolutely right, the difference is not as stark anymore. However, it's still worth noting that the nested layout (almost) doubles the memory consumption in 3D (38 = 24 bytes vs 4 or 8 (reference on system with <32G / >32G, respectively) + 16 (array) + 38 = 44 or 48 bytes per coordinate).
For using >2G points besides backing the positions with a CellImg, we would need to change int indices to long in many places, and that would negatively impact performance for "normal" scenarios, so I would not just do it "because we can".
Thanks, that was the part that I overlooked! In my head, backing the positions with a CellImg didn't infer with the existing implementations.
[...] I think this split makes sense.
I do see your point. Still, my feeling is that a split into an object holding the positions and one holding the values could also be beneficial, especially when adding a backing option for positions and/or values. Right now, the number of constructors of KDTreeData
seems to be a product of the number of backing options for positions and values, respectively; however, all non-creation methods in KDTreeData
only use positions or values, never both.
This is, of course, just from an outside perspective, so there might be arguments against my suggestions that I just overlook.
I run the benchmarks locally with 1M points:
Benchmark (Nested) Mode Cnt Score Error Units KDTreeBenchmark.createKDTree avgt 8 401.968 ± 24.802 ms/op KDTreeBenchmark.kNearestNeighborSearch avgt 8 98.354 ± 7.822 ms/op KDTreeBenchmark.nearestNeighborSearch avgt 8 67.467 ± 1.867 ms/op KDTreeBenchmark.radiusNeighborSearch avgt 8 70.195 ± 17.052 ms/op
Benchmark (Flat) Mode Cnt Score Error Units KDTreeBenchmark.createKDTree avgt 8 267.109 ± 7.922 ms/op KDTreeBenchmark.kNearestNeighborSearch avgt 8 82.138 ± 27.582 ms/op KDTreeBenchmark.nearestNeighborSearch avgt 8 62.839 ± 2.145 ms/op KDTreeBenchmark.radiusNeighborSearch avgt 8 64.165 ± 3.959 ms/op
Yes, cool! That looks as I would have expected. Which Java version did you benchmark it with?
You're absolutely right, the difference is not as stark anymore. However, it's still worth noting that the nested layout (almost) doubles the memory consumption in 3D (3_8 = 24 bytes vs 4 or 8 (reference on system with <32G / >32G, respectively) + 16 (array) + 3_8 = 44 or 48 bytes per coordinate).
I think you have the nesting order wrong. for 1000 3D points you would have a double[3][1000]
, so one double[3][]
array with 3 elements, and each element a double[1000]
array. So the memory overhead is not much. It's more about cache locality. (And perhaps one more indirect lookup for getting the double[]
for dimension d
and then looking up the value vs directly looking up the value in the flattened array. But I don't know how much of that is "speculated away".)
[...] I think this split makes sense.
I do see your point. Still, my feeling is that a split into an object holding the positions and one holding the values could also be beneficial, especially when adding a backing option for positions and/or values. Right now, the number of constructors of
KDTreeData
seems to be a product of the number of backing options for positions and values, respectively; however, all non-creation methods inKDTreeData
only use positions or values, never both. This is, of course, just from an outside perspective, so there might be arguments against my suggestions that I just overlook.
That is a good point. Splitting into into an object holding the positions and one holding the values makes sense. Probably having KDTreeData
as a container for the two is still useful, but not sure. Looking at the code again, it is also not ideal that KDTreeData
does the lazy bounding box computation, that goes a bit against the "pure data container" idea, but also maybe it's pragmatically okay.
If you want to play around with splitting up KDTreeData
, that would be cool. There is more related work in https://github.com/imglib/imglib2-algorithm/tree/kdtree (more searches) and https://github.com/tpietzsch/imglib2-pointcloud (N5 serialization experiments). It would be cool if you pull these along for refactoring.
Yes, cool! That looks as I would have expected. Which Java version did you benchmark it with?
Azul Zulu JDK 17.
I think you have the nesting order wrong.
Jep, my bad. So, the memory overhead is negligible and the runtime is also only slightly worse. That's great!
Looking at the code again, it is also not ideal that KDTreeData does the lazy bounding box computation, that goes a bit against the "pure data container" idea, but also maybe it's pragmatically okay.
The "correct" thing (modeling-wise) would probably be to eagerly compute the bounding box and pass it to the constructor. But from a performance perspective, this is strictly worse than the current lazy computation, which is definitely "pragmatically okay". Also, KDTreeData
seems like worth keeping for serialization?
If you want to play around with splitting up KDTreeData, that would be cool. There is more related work in https://github.com/imglib/imglib2-algorithm/tree/kdtree (more searches) and https://github.com/tpietzsch/imglib2-pointcloud (N5 serialization experiments). It would be cool if you pull these along for refactoring.
Sounds good, I'll give it a try in the next few days.
I incorporated changes from #334
This PR revises the implementation of KDTree.
The old implementation created
KDTreeNode
objects for each point, with references to children to represent the tree structure.The new implementation does not store explicit links. Instead, coordinates of all points are stored in a flat
double[]
array (or alternatively onedouble[]
array for each dimension). The tree structure is represented purely by the order of points in the array.The nodes in the tree are arranged in Eytzinger layout (children of
i
are at2i
and2i+1
). Additionally, pivot indices are chosen such that "leaf layers" are filled from the left. For example 10 nodes will always be arranged like this:never like this:
By choosing pivots in this way, the tree structure is fully determined. For every node index, the child indices can be calculated without dependent reads. And iff the calculated child index is less than the number of nodes, the child exists.
Less memory overhead
This has considerably less memory overhead: For example we store 1,000,000 3D
RealPoint
s into a KDTree. The list of points (ArrayList object and the RealPoint objects) requires ~82MB of memory. The old KDTree added ~90MB on top of that, just for storing the tree structure. Despite copying the point coordinates, the new KDTree only adds ~22MB (which is basically just thedouble[3_000_000]
array required for storing the coordinates.)Better performance
Performance of building trees is 2x better than the old implementation. Neighbor searches are slightly faster. By using a different kthElement implementation, this PR also resolves #326. (building the tree is fast, also in the case that many points have identical coordinates in a particular dimension).
Easy Serialization
Note that these
KDTree
s can be serialized very efficiently (for example into N5), by just storing/loading the flattened coordinate array as is (avoiding the overhead of serializing the object structure of the interlinkedKDTreeNode
s).