rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
6.66k stars 336 forks source link

Improve and document our GeoArrow compatibility story #8123

Open abey79 opened 1 week ago

abey79 commented 1 week ago

Fundamentally, the new map-related archetypes are interoperable with GeoArrow in an efficient way. The way to achieve this should be documented, and helpers should be introduced.

Here is a mostly self-explanatory script showing ways to efficiently log geoarrow data, interoperate with GeoPandas, and a hint of what future helpers could look like:

from __future__ import annotations

from typing import Iterable

import geoarrow.pyarrow as ga
from geopandas import GeoSeries
import pyarrow as pa

import rerun as rr

# Let's assume we have some GeoArrow data coming from somewhere
# IMPORTANT: This is only points. Things get a bit more complicated if we also introduce lines and polygons (the latter
# of which we don't support yet).
data = ga.as_geoarrow(
    ["MULTIPOINT((37.7749 -122.4194), (34.0522 -118.2437), (32.7157 -117.1611))"],
    coord_type=ga.CoordType.INTERLEAVED,  # Rerun needs interleaved coordinates for zero copy
)

rr.init("rerun_example_raw_geoarrow")
rr.connect_tcp()

# ======================================================================================================================
# Scenario 1: zero-copy logging of raw data in GeoArrow format to

# convert the data to a numpy array of the right shape without copy
data_as_np_array = data.storage.values.values.to_numpy(zero_copy_only=True).reshape(-1, 2)
rr.log("points", rr.GeoPoints(lat_lon=data_as_np_array))

# ======================================================================================================================
# Scenario 2: GeoPandas interoperability

# Load the GeoArrow data into a GeoPandas series (there are many other ways to load data to a GeoSeries)
series = GeoSeries.from_arrow(data)

rr.log("geopandas_points", rr.GeoPoints(lat_lon=series.get_coordinates().to_numpy(copy=False)))

# ======================================================================================================================
# Scenario 3: logging wrapper for GeoArrow data (we should definitely provide these in the SDK eventually)

class GeoArrowLatLonBatch(rr.ComponentBatchLike):
    def __init__(self, lat_lon: pa.Array) -> None:
        # TODO: validate input data format
        self.lat_lon = lat_lon

    def component_name(self) -> str:
        return "rerun.components.LatLon"

    def as_arrow_array(self) -> pa.Array:
        return self.lat_lon.storage.values.cast(rr.components.LatLon.arrow_type())

class GeoPointsWrapper:
    lat_lon: pa.Array

    def __init__(self, lat_lon: pa.Array) -> None:
        self.lat_lon = lat_lon

    def as_component_batches(self) -> Iterable[rr.ComponentBatchLike]:
        return [rr.GeoPoints.indicator(), GeoArrowLatLonBatch(lat_lon=self.lat_lon)]

rr.log("wrapper_points", GeoPointsWrapper(data))
kylebarron commented 1 day ago

👋 I've vaguely followed rerun for a while and it's really cool to see you adding spatial support!

The general structure of your data does indeed match the structure of GeoArrow points, but your data does not currently match the GeoArrow specification because the GeoArrow specification explicitly disallows latitude-longitude ordering. From the spec:

Note that regardless of the axis order specified by the CRS, axis order will be interpreted according to the wording in the GeoPackage WKB binary encoding: axis order is always (longitude, latitude) and (easting, northing) regardless of the the axis order encoded in the CRS specification.

Confusion around coordinate axis order is almost a trope in the community. (I'd encourage you to switch to lon-lat if at all possible)

Separately, technically GeoArrow currently requires float64 coordinates, and you use float32. The GeoArrow spec is pretty complex as it is, and it's unclear whether we'd want to include float32 as part of the spec, but I don't think it would be controversial to recommend that tools support both float32 or float64 input, and then float32 and float64 would at least be interoperable.

(For context, I'm developing the Rust GeoArrow implementation, which also has Python bindings. Your use of geoarrow.pyarrow is built on the alternate C source.)

jleibs commented 1 day ago

Hi @kylebarron -- excited to see you here and thanks for bringing this to our attention. Hopefully we can get ahead of this both on the format and documentation front and save our users some pain and confusion.

For what it's worth I actually tried to look this up in the https://geoarrow.org/format.html spec and not finding any mention of latitude or longitude, but knowing that geoarrow allowed providing the CRS, I naively jumped to the conclusion that the CRS-specified order would be respected.

I had not even imagined that a spec might both provide a means of including CRS data and also explicitly disregard the axis-order the CRS specifies. Though the fact that the chain of transitively-cited links ultimately terminates in the gem of OGC 08-038r7 Case 4 and the final clause "The OGC does not recommend implementing this use case" at least makes me feel better about my incorrect assumption :sweat_smile:

Although I can very much relate to the pragmatic reasons geoarrow went down this path, I'm somewhat reluctant to fully propagate these Case-4 style clauses myself in our own final spec, and so I suspect that once we include support for explicit CRS we will generally respect it (as suggested by Case 1 in the linked recommendation doc), but this whole topic has made me keenly aware that specifying a CRS alone is insufficient for all use-cases and we will likely need to also allow for an additional axis-order metadata that overrides the CRS (as suggested by Case 2).

Regardless, in any context where we find ourselves reading / providing data where we find or apply the geoarrow extension metadata, we'll definitely make sure to track the swapped order and/or shuffle the data if necessary, as required per the geoarrow spec.

Also, until we start allowing CRS data to be specified, I'm generally open to swapping the default order on our stored component data to align with geoarrow since this is likely to be the more common representation among folks already working with arrow natively. Needless to say, the choice shall be thoroughly documented and specification-linked to make sure there's no ambiguity.

paleolimbot commented 1 day ago

geoarrow-c/-python developer here...very cool!

Just a note that our axis order assumption policy is a purely pragmatic move...this is ultimately something that needs to be solved at the CRS level (although unfortunately, the people developing CRS specs are the ones pointing at implementations saying "you're doing it wrong!", and implementors mostly keep ignoring them because following their advice leads to angry users whose data is constantly misinterpreted).

I've seen this handled as a boolean ("authority_compliant: true") and as an integer array ("permutation: [0, 1]"). If you're using the "separated"/"struct" coordinate representation, it is almost instantaneous to reorder your axes on import/export (if you're using an interleaved representation, you'll have to make a copy).

We might allow something like that in the GeoArrow specification, although I would prefer that the CRS specification people got it together and realized that requiring every single real-world implementation to have two fields to encode a CRS is not realistic (they either need to provide a CRS database that has CRSes with axis orders that match real-world data OR allow the axis order as part of some official CRS spec definition).

jorisvandenbossche commented 1 day ago

And then also a GeoPandas developer chiming in .. :)

# Scenario 2: GeoPandas interoperability

# Load the GeoArrow data into a GeoPandas series (there are many other ways to load data to a GeoSeries)
series = GeoSeries.from_arrow(data)

rr.log("geopandas_points", rr.GeoPoints(lat_lon=series.get_coordinates().to_numpy(copy=False)))

For supporting geopandas input, you could actually also reuse your general geoarrow support (at least when requiring geopandas >= 1.0, and you are fine with requiring pyarrow for that use case), because you can ask the geopandas object to convert itself to geoarrow using the new series.to_arrow(geometry_encoding="geoarrow", interleaved=True) method. And at that point you can reuse your code to ingest geoarrow data, instead of writing a custom conversion for geopandas (like the series.get_coordinates().to_numpy(copy=False) you have above, which is easy for points, but gets more complicated for other geometry types).

jleibs commented 1 day ago

@paleolimbot

I've seen this handled as a boolean ("authority_compliant: true") and as an integer array ("permutation: [0, 1]").

I was already kind of mentally weighing these two options. Do you recall which other systems/libraries do this. I would love to point at greater ecosystem alignment if possible when making a decision.

I would prefer that the CRS specification people got it together and realized that requiring every single real-world implementation to have two fields to encode a CRS is not realistic (they either need to provide a CRS database that has CRSes with axis orders that match real-world data OR allow the axis order as part of some official CRS spec definition).

This is a great summary.

I'm still quite new to this geo community -- what is the mechanism of communication with the "CRS specification people"? I'm assuming this is more involved than upvoting a GitHub issue somewhere or it would have been solved years ago, but is there some individual/committee we can email and complain at?

abey79 commented 1 day ago

Separately, technically GeoArrow currently requires float64 coordinates, and you use float32.

This is not accurate. Although we do use float32 extensively throughout the rerun data model, the geospatial primitive actually use float64. For example the, LatLon component delegates to the DVec2 datatype, which is float64. This can be verified e.g. in this code-generated arrow datatype definition.

kylebarron commented 1 day ago

Separately, technically GeoArrow currently requires float64 coordinates, and you use float32.

This is not accurate.

Ah, apologies; I must've been looking at the wrong place in the flatbuffers.


Aside from the CRS interoperability discussion above (I generally defer to @paleolimbot and @jorisvandenbossche on that front), I'd also encourage you to implement support for the Arrow PyCapsule Interface in your Python bindings. It enables safe Arrow data interchange without requiring the use of pyarrow. Any library implementing this protocol can exchange data via PyCapsules (safe wrappers of C pointers), where data producer and consumer don't need prior knowledge of each other.

There are a growing number of Python libraries that support this protocol. So for example, you could exchange data zero-copy with Polars or DuckDB without needing to go through pyarrow.

This applies to all Arrow data, not just GeoArrow data. Perhaps a separate issue would be better for this if you're interested in discussing it more?

jleibs commented 1 day ago

I'd also encourage you to implement support for the Arrow PyCapsule Interface in your Python bindings

Yes -- I was looking into this recently and it looked very promising. Glad to hear an endorsement.

paleolimbot commented 23 hours ago

Do you recall which other systems/libraries do this. I would love to point at greater ecosystem alignment if possible when making a decision.

The "global boolean flag" mechanism is used by sf: https://r-spatial.github.io/sf/reference/st_crs.html#arg-authority-compliant . This basically works by invoking proj_normalize_for_visualization() based on the flag: https://github.com/r-spatial/sf/blob/7dd4173f69db6ae6291d529ff333330c6ef18f75/src/proj.cpp#L120-L124 . The flag defaults to "false". This is also what GeoServer does (there's a startup flag you can use to force the authority compliant case: https://docs.geoserver.org/main/en/user/services/wfs/axis_order.html ), and GDAL does so as well ( OSR_DEFAULT_AXIS_MAPPING_STRATEGY=[TRADITIONAL_GIS_ORDER​/​AUTHORITY_COMPLIANT], https://gdal.org/en/latest/user/configoptions.html ).

I checked for any existing usage of permutation...I had assumed there was one because somebody had suggested it in some of our GeoParquet discussions, but it turns out nobody actually does this (despite it possibly being a good idea).

what is the mechanism of communication with the "CRS specification people"?

I'm actually not sure...probably "the OGC" but I don't know which mailing list would be relevant. I take the emperical/pragmatic view that there is no "right" or "wrong", just users with data that it's my job not to misinterpret.

paleolimbot commented 23 hours ago

Oh, just kidding, I just found: https://gdal.org/en/latest/doxygen/classOGRSpatialReference.html#a1db6f8f544502740b218ba4526cebfd4 which implements the "permutation").

(Sorry for the overabundance of detail here...I'm writing about this in an unrelated document and so I've been collecting these links 🙂 )

jleibs commented 8 hours ago

Sorry for the overabundance of detail here...

@paleolimbot no apology necessary. Thank you for the detailed pointers. Knowledge about what the community expects for things like this is so valuable to us.