rapidsai / cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms
https://docs.rapids.ai/api/cuspatial/stable/
Apache License 2.0
595 stars 150 forks source link

[FEA] Thrust, Cython, and Python to convert `wkb` format to GeoArrow #601

Open thomcom opened 2 years ago

thomcom commented 2 years ago

Is your feature request related to a problem? Please describe.

I've identified an opportunity for a massive speedup in cuspatial I/O using the WKB format. It'll also be more language/API agnostic than the existing GeoPandas and Shapely I/O that we support.

Currently we can convert shapefiles of polygons into GeoArrow format on GPU, and we can convert GeoPandas dataframes into GeoArrow. GeoPandas dataframes have to pass through the host first, and the serialization of those dataframes can be pretty slow.

Aside from these two methods, getting any FeatureSet into GeoArrow format for use with our algorithms can be challenging for the user.

Describe the solution you'd like

WKB is a nearly-universal geometry representation in spatial databases. It doesn't lend itself trivially to parallelization because each Feature (Point, MultiPoint, MultiPolygon, etc) is constructed by a variable-length header, followed by 64 bit floating point geometry buffers.

We could write a GPU-based WKB parser that takes singular WKB features and constructs the correct GeoArrow offset buffers, and packs the WKB bytes into the correct position in each coordinate buffer.

It would need to take a list of WKB buffers, then iterate over them and create the offsets buffers via a scan algorithm, then each object could be passed to a kernel with its known-offset values. The kernel would copy the remaining coordinates into the right place in the GeoArrow buffers.

We'd also like a reverse, taking a GeoArrow object comprised of an arrow DenseUnion object with input_types, offsets, points, mpoints, linestrings, and polygons buffers. These could be converted quickly into WKB format, returning a set of buffers that can be quickly converted into any geometry database format.

zhangjianting commented 2 years ago

Does the idea also applicable to WKT and/or GeoJSON? Since cuDF already has a json parser on GPUs, could it be easier to begin with GeoJSON?

thomcom commented 2 years ago

We're working on the JSON parser now, hopefully it matures soon. For the present, we don't have GPU-based geometry parsing outside of a particular method of parsing very-consisent files, like those described in the FeatureCollection example on Wikipedia. Parsing those files can be implemented by carefully slicing the input file on GPU and then parsing it using the fromJsonObject methods. It isn't a one-shot solution by any means.

A WKB parser would make a WKT round-trip easier. However, parsing WKT would involve substantially more complicated string processing, likely a lot slower, then using the offsets and buffer sizes that are fundamental to WKB and GeoArrow.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.