jbuckmccready / static_aabb2d_index

Fast static 2D axis aligned bounding box index
Apache License 2.0
29 stars 1 forks source link

Zero-copy compatibility with the JS implementation? #3

Open kylebarron opened 1 year ago

kylebarron commented 1 year ago

I assume the answer will be no, in which case feel free to close this issue.

I'm interested in use cases where I can share the index data with other implementations, such as flatbush in the browser or a python flatbush implementation. In the original JS implementation, the entire index is self-contained in a single buffer, which allows it to be shared both across web workers in the browser, but also with e.g. a rust-wasm implementation.

I see in your implementation you use a sensible rust approach that uses rust-native types. There would probably need to be significant changes to the implementation to use a single backing buffer, which you understandably might not be interested in, for only the prospect of future FFI potentials. Any thoughts?

jbuckmccready commented 1 year ago

Hmm, I don't have a use case for this but I understand the utility if trying to interop with other languages using the same library in the same memory space. I think it could be done in Rust using the bytemuck crate (https://docs.rs/bytemuck/latest/bytemuck/), create a buffer of bytes and then cast subslices of the buffer to the different types (number type and index type) as needed. It would require additional constraints on the generic traits and add a dependency on bytemuck.

I am open to pull requests that add this functionality as an additional module behind a feature flag (to avoid bytemuck dependency if not needed). The module would contain new spatial index types which utilize a single byte buffer, have different generic number constraints to work with bytemuck (or possibly not generic at all?), and share the same core algorithm (spatial index/math functions could be shared between the the modules where possible).

If you are wanting to share the data using the same format but don't need to have it zero copy in the same memory space then writing a serializer/deserializer to/from a byte buffer in the same format would be simpler.

kylebarron commented 1 year ago

Thanks for the reply! If you don't have an FFI-related use case, it might be simpler for me to prototype the library from scratch instead of trying to shoehorn it into how you've written the library so far. And then maybe at some point in the future once my version is implemented, we can see whether it makes sense to keep them as two libraries or combine them.

If you are wanting to share the data using the same format but don't need to have it zero copy in the same memory space then writing a serializer/deserializer to/from a byte buffer in the same format would be simpler.

Yeah definitely, but I'm really focused on applications that share the same memory space (e.g. for rust-python), so even though zero-copy will be more work, I'm inclined towards that.

jbuckmccready commented 1 year ago

EDIT: Removed most of this reply because it was already answered!

I am curious what your use case is that causes the non-zero copy solution to be noticeable/the slow step, and if you do pursue the zero copy byte reinterpretation approach I'd be curious to see how it works out, cool project.

kylebarron commented 1 year ago

I'm trying to build out an open ecosystem for extensible, modular geospatial analytical data processing. I'm excited about Rust to speed up Python and JS via compiled extension modules. It's true that you can create Python bindings to a Rust library, have Rust manage the memory, and never need to worry about zero-copy. But when someone else writes a C library that would like to interface with your data, if you don't have an ABI-stable way to share the data, you need to serialize it and they need to deserialize it.

For example, In Python, Shapely (and by extension the C library GEOS) is used for most geospatial data storage. But separate Python libraries with C extensions can't use the same GEOS memory because the underlying storage isn't ABI-stable. So there has to be a serde step in between.

Apache Arrow solves this problem for generic tabular data, because it defines a language-independent, ABI-stable memory layout. So you can move memory between Python/Rust/C just by changing ownership of the pointer (e.g. Polars does this from Rust). GeoArrow, a WIP spec I'm a part of, builds on top of Arrow and solves that problem for geometries, so you can share an array of geometries between Python/Rust/C for free. (I've been working on a rust implementation of this).

But it's very useful to be able to share large spatial data, declare that the data is already spatially ordered, and share a spatial index for free. Of the RTree implementations I'm aware of, the only one that's fast, compact, and has an ABI-stable memory layout is the Flatbush algorithm.

My time horizon is pretty long... I'm planning on 2-5 years to maturity of GeoArrow. But watching how much innovation in the non-spatial realm is happening from Arrow, I think there's massive potential in a geospatial zero-copy ecosystem. And given that spatial indexes are absolutely core to geospatial algorithms, it's worth investing in a zero-copy rtree implementation.