Support GeoPandas GeoDataFrames

jorisvandenbossche commented 3 years ago

Context:

Datashader requires plain arrays of coordinates ("ragged arrays" in case of lines/polygons, with additional offset/indices arrays) to efficiently visualize geometries (and this is also the representation that SpatialPandas uses under the hood).
GeoPandas stores geometries as "opaque" Shapely objects (wrapping GEOS C++ object), and converting this to an array of coordinates is currently not always very efficient (although it's certainly possible, it's also what spatialpandas does in the from_geopandas conversion code eg here)

With the latest release of PyGEOS, the conversion from geometries to (ragged) coordinate arrays can be done much more efficiently, though.

Function using PyGEOS to convert array of GEOS geometries to arrays of coordinates / offsets (+ putting those in a spatialpandas array)

```python def get_flat_coords_offset_arrays(arr): """ Version for MultiPolygon data """ # explode/flatten the MultiPolygons arr_flat, part_indices = pygeos.get_parts(arr, return_index=True) # the offsets into the multipolygon parts offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0) # explode/flatten the Polygons into Rings arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True) # the offsets into the exterior/interior rings of the multipolygon parts offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0) # the coords and offsets into the coordinates of the rings coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True) offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0) return coords, offsets1, offsets2, offsets3 def spatialpandas_from_pygeos(arr): """ Create the actual spatialpandas MultiPolygonArray by putting the individual arrays into a pyarrow ListArray """ coords, offsets1, offsets2, offsets3 = get_flat_coords_offset_arrays(arr) coords_flat = coords.ravel() offsets3 *= 2 # create a pyarrow array from this _parr3 = pa.ListArray.from_arrays(pa.array(offsets3), pa.array(coords_flat)) _parr2 = pa.ListArray.from_arrays(pa.array(offsets2), _parr3) parr = pa.ListArray.from_arrays(pa.array(offsets1), _parr2) return spatialpandas.geometry.MultiPolygonArray(parr) ```

With such a faster conversion available, it becomes more interesting for Datashader to directly support geopandas.GeoDataFrame, instead of requiring an up-front conversion to spatialpandas.GeoDataFrame. Currently, the spatialpandas requirement is hardcoded here (for polygons()):

https://github.com/holoviz/datashader/blob/1ae52b65ec8a79920e5db9c6c04487f254428553/datashader/core.py#L694-L701

Adding support for GeoPandas can be done, using the function I defined above, with something like (leaving aside imports of geopandas/pygeos):

    from spatialpandas import GeoDataFrame
    from spatialpandas.dask import DaskGeoDataFrame
    if isinstance(source, DaskGeoDataFrame):
        # Downselect partitions to those that may contain polygons in viewport
        x_range = self.x_range if self.x_range is not None else (None, None)
        y_range = self.y_range if self.y_range is not None else (None, None)
        source = source.cx_partitions[slice(*x_range), slice(*y_range)]
+   elif isinstance(source, geopandas.GeoDataFrame):
+      # Downselect actual rows to those for which the polygon is in viewport
+      x_range = self.x_range if self.x_range is not None else (None, None)
+      y_range = self.y_range if self.y_range is not None else (None, None)
+      source = source.cx[slice(*x_range), slice(*y_range)]
+      # Convert the subset to ragged array format of spatialpandas
+      geometries = spatialpandas_from_pygeos(source.geometry.array.data)
+      source = pd.DataFrame(source)
+      source["geometry"] = geometries
    elif not isinstance(source, GeoDataFrame):
        raise ValueError(
            "source must be an instance of spatialpandas.GeoDataFrame or \n"

This patch is what I tried in the following notebook, first using a smaller countries/provinces dataset from NaturalEarth, and then with a larger NYC building footprints dataset (similar to https://examples.pyviz.org/nyc_buildings/nyc_buildings.html).

Notebook: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3e7ce14cb5118daa0f6097d686981c9f

Some observations:

This actually works nicely!
Initial rendering with datashader is a bit slower when directly using geopandas.GeoDataFrame because of the extra conversion step. But, the conversion takes less time than the actual rendering, so it's only a relatively small slowdown.
Zooming into small areas is really fast with the large dataset. And actually faster as using spatialpandas.GeoDataFrame (because I added a .cx spatial subsetting step in my patch above, filtering the data before rendering). For spatialpandas, such subsetting is only added for the dask version.

Gif of the notebook in action (the buildings dataset is fully loaded in memory, and not pararellized with dask, unlike the PyViz gallery example), interactively zooming into a GeoPandas dataframe with Datashader and Holoviews:

Peek 2021-06-08 13-45

(note this was done a bit manually with Holoviews DynamicMap and a callback with Datashader code, because the integrated datashade functionality of Holoviews/HvPlot wouldn't preserve the geopandas.GeoDataFrame with the current versions)

So, what's the way forward here? I think I showed that it can be useful for Datashader to directly support GeoPandas, and that it can also be done with a relatively small change to datashader. The big question, though, is about the "GEOS -> ragged coordinate arrays -> spatialpandas array" conversion. Where should this live / how should DataShader and GeoPandas interact?

Some initial thoughts about this:

The quickest way to get this working is to do this conversion in DataShader (the above functions only rely on pygeos (which the user will already have when using GeoPandas) and pyarrow/spatialpandas (which are already requirements for this part of datashader)). But, long term, is this code that Datashader wants to maintain? Or is there a more logical place for this code?
It could also live in SpatialPandas, since they already have code for such conversion of GeoPandas <-> SpatialPandas (and it would optimize its current implementation of that). But, should a user need to have SpatialPandas to plot GeoPandas with Datashader? (see also last bullet point)
Alternatively, GeoPandas could add a function or method that converts its geometries into this required format, and then Datashader can call that method to get the data it needs. Long term, this might be the better solution (since other projects interacting with geopandas might also want to get the geometries in this format).
How to communicate this data? In the current POC version included above, I first get the raw coordinate and offset arrays, and then convert them into a pyarrow.ListArray to then convert it to a spatialpandas MultiPolygonArray. But in the end, what Datashader needs is only the raw coordinates and offsets arrays.
For example, for rendering polygons, you access .buffer_values and .buffer_offsets of the MultiPolygonArray, which gives back the raw coordinate and offset arrays. So in theory, this roundtrip through pyarrow and spatialpandas is not needed, and some method could convert GeoPandas geometries into coordinate/offset arrays, which could be directly handled by datashader as is. This would however require a bit more changes in datashader in the way that data gets passed down from Canvas.polygons() into the glyph rendering (as currently that uses the spatiapandas array as container for the coordinates/offsets).

One possible idea (relating to the third bullet point) is to standardize on some kind of __geo_arrow_arrays__ interface (returning the coordinate + offset arrays), similarly to the existing __geo_interface__ that returns the geometries in GeoJSON-like dictionary (and which can be used now for accepting any "geometry-like" object even from libraries you don't know).

ablythed commented 3 years ago

@jorisvandenbossche Can you fill in some details?

jorisvandenbossche commented 3 years ago

@ablythed thanks for the reminder. I started a draft at the time, but now finished it up. I updated the top post.

martinfleis commented 3 years ago

I think that this can be useful elsewhere and should live in GeoPandas. It may also be easier for us to maintain it than for datashader.

One possible idea (relating to the third bullet point) is to standardize on some kind of geo_arrow_arrays interface

+1 for this idea.

I would also say that we should find a way of a direct interface between GeoPandas and Datashader, one which does not depend on spatialpandas. From what I understood from @jbednar, if there'll be an efficient interface of this kind they'll be more than happy to retire spatialpandas (I guess once we'll manage to have the similar one in dask-geopandas).

jbednar commented 3 years ago

Very cool! Thanks for all this; I'd love to see direct Datashader support for GeoPandas data! As a secondary priority, I'd also love to see all of the SpatialPandas functionality disappear into other suitable libraries; we want to keep the ecosystem/landscape small and manageable for users except when deep and genuine differences in requirements dictate.

Putting the "GEOS -> ragged coordinate arrays" conversion code into GeoPandas seems to me like it would make the most sense, with Datashader and potentially Dask-GeoPandas working directly with that output. As Martin indicates, the raw format could be useful for other algorithms as well. I agree that standardizing on "some kind of __geo_arrow_arrays__ interface" would be necessary for this to work, and I am happy for Datashader to follow GeoPandas's lead on this. I.e., if GeoPandas could define such an interface and demonstrate that it works with Datashader, we could remove the trip through SpatialPandas and support GeoPandas directly.

If we do that, can we further remove the underlying need for SpatialPandas to exist at all? As background, I can recall four reasons we created SpatialPandas instead of using or extending existing libraries like GeoPandas and now Dask-GeoPandas:

Datashader already has fairly heavyweight dependencies because of Numba and Dask that make it difficult to install even on its own, and we don't want to make it unusable by add anying additional difficult-to-install required dependencies for rendering lines or polygons. In particular, we can't require that Datashader users install fiona or gdal as previously required by GeoPandas, particularly given that most Datashader users are not geoscientists and are not accustomed to wrestling with those difficult library dependencies.
Datashader is intended for use in both geospatial and non-geo applications, and thus needs to be fully usable for both geo-referenced and non-geo-referenced data. Datashader should be just as useful for rendering printed circuit boards and Voronoi diagrams as for rendering states and counties.
Datashader needs access to the raw coordinate arrays in large blocks if it is to be able to efficiently render large datasets, and (as mentioned above) GeoPandas arrays/Series of Shapely objects do not provide data in suitable chunks for Datashader to blaze through.
The above 3 reasons would all suggest building functionality into Datashader itself, which indeed is what we did initially, but given that the resulting data structures were also useful for spatial indexing and for fast vectorized non-viz spatial algorithms in general, such as point-in-polygon testing and distance metrics, we split SpatialPandas out into its own package that's about spatial processing rather than viz.

I'm very happy to revisit these four considerations now and think about where we've ended up a few years later. The situation has definitely improved, largely through the hard work of people on the GeoPandas team:

GeoPandas is now installable as geopandas-base, with no heavyweight dependencies that I can see! That came just in the nick of time to fix our SpatialPandas builds; thanks so much! Specifically, geopandas-base allows installations without gdal/fiona, which makes it feasible for non-geo people to install both Datashader and GeoPandas without wrestling with environments.
Apart from installation issues, relying on GeoPandas alone as our representation for lines and shapes is problematic, because GeoPandas docs makes it clear it's intended only for geographic applications, not spatial applications in general. The geopandas.org home page says explicitly that the goal is to "make working with geospatial data in python easier", and the about page says that "GeoPandas is an open source project to add support for geographic data to pandas objects", so it seems clear that non-geographic applications should not be expected to be supported by GeoPandas.
Joris's code above seems to address reason 3 quite well, providing a way for Datashader to "get at" the underlying coordinates. I can't tell if the result is a copy of all the data or just a view of chunks. If it's a copy, that should still be useful, but shouldn't be the only way we can render things, as it will limit how large an array we can work with.
While Datashader doesn't seem like a good home for spatial algorithms, can they live in GeoPandas or Dask-GeoPandas, using Joris's proposed representation? I'm not sure, given that https://github.com/holoviz/spatialpandas/issues/1 doesn't seem to list any that are geographically specialized; they all seem just like spatial tasks independent of the Earth's surface. E.g. we'd like to use point-in-polygon testing for interactive apps regardless of the type of data.

So, where does that leave us? Seems like GeoPandas has addressed reason 1 and there's a good plan to address reason 3. What about reasons 2 and 4? Is it reasonable for code that does not assume it's used with geographic shapes to live in GeoPandas? Would GeoPandas be ok with stating on the home page that "Geo" here means both "Geographic" and "Geometric", and to say that while the algorithms in GeoPandas are largely inspired by geographic applications, they should also be fully usable for 2D geometry in general? If so I don't think we'd need to continue with SpatialPandas at all, and can coalesce around GeoPandas as the data structure and spatial algorithms while supporting Datashader for bulk rendering.

martinfleis commented 3 years ago

@jbednar GeoPandas talks about geographic data while in reality supports any planar geometry no matter the origin.

Would GeoPandas be ok with stating on the home page that "Geo" here means both "Geographic" and "Geometry", and to say that while the algorithms in GeoPandas are largely inspired by geographic applications, they should also be fully usable for 2D geometry in general?

Yes. While we don't mention it in the docs at the moment, GeoPandas data structures and functionality is fully usable for 2D geometry in general in the same way shapely/pygeos is. We just add projection support on top if one needs it. I'll open an issue in the GeoPandas repo to clarify the documentation in this sense. -> https://github.com/geopandas/geopandas/issues/1971

jbednar commented 3 years ago

Perfect, thanks! If I can tell people to use GeoPandas for all their 2D planar shapes regardless of what they are, then I am very happy for Datashader to work directly with whatever the rawest form of coordinate access GeoPandas can provide as the way to work with ragged shapes using Numba and Dask. (Non-ragged shapes like dense n-D arrays of same-length lines can already be supported by xarray and numpy.) Excellent!

NickLucche commented 2 years ago

Hi, any news or follow-up on this feature? Is there any further optimization we could benefit from with the on-going transition to Shapely 2.0..?

ianthomas23 commented 11 months ago

I have started looking at this, work-in-progress PR is #1285. I am happy to talk about it there or here.

holoviz / datashader

Support GeoPandas GeoDataFrames #1006