rapidsai / cuspatial

CUDA-accelerated GIS and spatiotemporal algorithms
https://docs.rapids.ai/api/cuspatial/stable/
Apache License 2.0
625 stars 154 forks source link

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

Open paleolimbot opened 1 year ago

paleolimbot commented 1 year ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Now that geoarrow-pyarrow ( https://github.com/geoarrow/geoarrow-python ) is available and the GeoArrow specification has an initial 0.1 release, there are potential synergies we may be able to leverage given the common memory layout! Basically, geoarrow-pyarrow implements a pyarrow.DataType subclass for geometry with a type-level place to store the coordinate reference system. It would be very cool if cudf.Series.from_arrow() could handle these (or whatever the best interface is from your end).

I also think it has the potential to significantly speed up IO from the current geopandas.read_file() + cuspatial.GeoSeries.from_geopandas() (rough estimate from some musings below assembled linestrings from a large ish FlatGeoBuf about 20x faster).

Happy to implement anything in geoarrow-c or geoarrow-python that makes this easier! We're slowly working on getting both on conda-forge (they're on pip already).

Describe any alternatives you have considered

The closest thing that currently provides this functionality is from_geopandas(), with Shapely's to_ragged_array and from_ragged_array also providing similar buffer building/parsing capability.

Additional context

Some musings with a large-ish linestring dataset (with apologies if I'm missing some obvious usage I should be aware of):

# Get the data in .fgb form
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip \
#     -o ns-water-water_line.fgb.zip
# ! unzip ns-water-water_line.fgb.zip
# pip install geoarrow-pyarrow
import cudf
import cuspatial
import geopandas
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io
import pyarrow as pa

host_table = io.read_pyogrio_table("ns-water-water_line.fgb")
#> 0.4 sec
#> Would be great if this worked!
#> cudf.Series.from_arrow(host_table["wkb_geometry"])
#> CUDF failure at:/opt/conda/conda-bld/work/cpp/src/interop/from_arrow.cu:87: Unsupported type_id conversion to cudf

# Workaround
def geoarrow_to_cuspatial(arr):
    arr = ga.as_geoarrow(arr, coord_type=ga.CoordType.INTERLEAVED)
    validity, part_offset, geometry_offset, xy = arr.geobuffers()
    assert validity is None # null geometries not supported?
    assert arr.offset == 0 # slices not reflected in geobuffers currently
    return cuspatial.GeoSeries.from_linestrings_xy(xy, geometry_offset, part_offset)

chunks = [geoarrow_to_cuspatial(chunk) for chunk in host_table["wkb_geometry"].chunks]
#> 0.7 sec
#> Can't seem to concatenate to get a contiguous array for direct comparison
#> gpu_geom2 = cudf.concat(chunks)

There are more example datasets at https://geoarrow.org/data as well (although I'm sure you have many internally as well).

GPUtester commented 1 year ago

Hi @paleolimbot!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.

harrism commented 1 year ago

Thanks for the feature request. @paleolimbot where is the CRS in the example?

paleolimbot commented 1 year ago

It's a property of the (Arrow) type!

from geoarrow.pyarrow import io

tbl = io.read_pyogrio_table("/vsizip/vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point.fgb.zip")
tbl["wkb_geometry"].type.crs
#> '{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"Projected...

The full serialization of the type is described in the 'extension types' section ( https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ), and you can access the it using type.__arrow_ext_serialize__() (e.g., tbl["wkb_geometry"].type.__arrow_ext_serialize__() above). (The CRS is the main thing that's in the serialization)

thomcom commented 1 year ago

Hey @paleolimbot ! Thanks for the update. I've been following your geoarrow work for a long while and am pretty excited to integrate it. I wrote a simple wrapper a few months ago before geoarrow.pyarrow that pulled the offset buffers and was able to construct cuspatial data from it easily and fast. We will definitely be integrating your work. Is it available as a dependency in pip, yet?

paleolimbot commented 1 year ago

Is it available as a dependency in pip, yet?

Yes! pip install geoarrow-pyarrow should do it. I have the lower-level geoarrow-c on conda-forge and will submit the PR to add geoarrow-pyarrow in the next few days.