Feature request: allow other predicates for determining leaf nodes than node capacity

asinghvi17 commented 4 days ago

I want to partition a dataset of geometries into, finally, leaf nodes whose areas are approximately the same. I don't really care about the number of geometries per node, since my problem is I/O on the raster dataset I'm using.

My understanding is that this would need a custom predicate that decides when to split, and it could receive a vector of the extents of the geometries as input.

Would there be interest in a PR that adds this feature? I could code it up, but curious to hear your thoughts @maxfreu.

maxfreu commented 3 days ago

Hi! :) Why do you think this would increase the io perf on your ds? Can you explain a bit more where the bottleneck is?

asinghvi17 commented 3 days ago

I did a bit more investigating, it seems my problem is that the extents of my STRtree's leaf nodes look like this:

download-9

(note the large y-extents when it doesn't really seem necessary)

asinghvi17 commented 3 days ago

I have a problem where I need to compute a statistic across a large, lazy, very slow to load raster (a 30m height map) over 250,000 polygons. I want to partition these polygons in space such that I can load sufficiently small segments of the raster, process that over the polygons in that segment, and then move on to the next segment. This drastically increases my processing times and can be made generic to essentially any dataset.

This means that I basically do a depth-first search across a tree, and stop the search for a particular path down the tree if I either encounter a leaf node, or a node with small enough extent that it would fit into memory.

The problem is that if the leaf node is too large, I can't fit my data into memory, and so I end up needing to process each polygon individually and load data within its bounding box from the raster. This is a time-consuming and painful operation.

For reference, if I manually map the computation across 2x2 degree segments of the raster, I can get this done in an hour. Using the tiling approach, it takes me around 11 hours for the same thing.

maxfreu commented 2 days ago

Ok, I see. The striping is bad indeed. Maybe an R* tree would perform better, might be worth to look into https://infolab.usc.edu/csci599/Fall2001/paper/rstar-tree.pdf

R* trees seem to look less striped (best viewed with white background):

Actually, I'm not an expert in spatial trees, just coded this up in an afternoon cause I needed some minimal solution. So there could also be a bug in the splitting logic, but I'm pretty sure I nailed https://apps.dtic.mil/sti/tr/pdf/ADA324493.pdf Feel free to check!

I don't know what the best way forward is:

Implement and try the R* tree
Implement custom, area-based splitting heuristic
Put things in separate package or here?

The easiest way to achieve what you need would probably be to implement the custom predicate you mentioned. Maybe like so: 1) Ingest all objects and their extents 2) Sort them by x center 3) Aggregate objects until they cover at most a set maximum x range 4) Sort these objects by y center 5) Make new leaf node with these objects and aggregate again, until they cover at most a y range that is computed based on the chosen x split. Then start the next leaf node until all objects for this x split are consumed.

Or how did you imagine it?

If you can come up with a clean idea how different splitting logics could be implemented, I wouldn't hold up the merge ;-) However, this sounds like it's rather something for a more abstract package like an imaginary "SpatialTrees".

maxfreu / SortTileRecursiveTree.jl

Feature request: allow other predicates for determining leaf nodes than node capacity #6