holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 25 forks source link

SpatialPandas design and features #1

Open jonmmease opened 5 years ago

jonmmease commented 5 years ago

spatialpandas

Pandas and Dask extensions for vectorized spatial and geometric operations.

This proposal is a plan towards extracting the functionality of the spatial/geometric utilities developed in Datashader into this separate, general-purpose library. Some of the functionality here has now (as of mid-2020) been implemented as marked below, but much remains to be done!

Goals

This project has several goals:

  1. [x] Provide a set of pandas ExtensionArrays for storing columns of discrete geometric objects (Points, Polygons, etc.). Unlike the GeoPandas GeoSeries, there will be a separate extension array for each geometry type (PointsArray, PolygonsArray, etc.), and the underlying representation for the entire array will be stored in a single contiguous ragged array that is suitable for fast processing using numba.
  2. [ ] (partially done; see below) Provide a collection of vectorized geometric operations on these extension arrays, including most of the same operations that are included in shapely/geopandas, but implemented in numba rather than relying on the GEOS and libspatialindex C++ libraries. These vectorized Numba implementations will support CPU thread-based parallelization, and will also provide the foundation for future GPU support.
  3. [x] Provide Dask DataFrame extensions for spatially partitioning DataFrames containing geometry columns. Also provide Dask DataFrame/Series implementations of the same geometric operations, but that also take advantage of spatial partitioning. This would effectively replace the Datashader SpatialPointsFrame and would support all geometry types, not only points.
  4. [x] Provide round-trip serialization of pandas/dask data structures containing geometry columns to/from parquet. Writing a Dask DataFrame to a partitioned parquet file would optionally include Hilbert R-tree packing to optimize the partitions for later spatial access on read.
  5. [x] Fast import/export to/from shapely and geopandas. This may rely on the pygeos library to interface directly with the GEOS objects, rather than calling shapely methods.

These features will make it very efficient for Datashader to process large geometry collections. They will also make it more efficient for HoloViews to perform linked selection on large spatially distributed datasets.

Non-goals

spatialpandas will be focused on geometry only, not geography. As such:

Features

Extension Arrays

The spatialpandas.geometry package will contain geometry classes and pandas extension arrays for each geometry type

Spatial Index

Extension Array Geometric Operations

The extension arrays above will have methods/properties for shapely-compatible geometric operations. These are implemented as parallel vectorized numba functions. Supportable operations include:

Only a minimal subset of these will be implemented by the core developers, but others can be added relatively easily by users and other contributors by starting with one of the implemented methods as an example, then adding code from other published libraries (but Numba-ized and Dask-ified if possible!).

Pandas accessors

Custom pandas Series accessor is included to expose these geometric operations at the Series level. E.g.

Custom pandas DataFrame accessor is included to track the current "active" geometry column for a DataFrame, and provide DataFrame level operations. E.g.

Dask accessors

A custom Dask Series accessor is included to expose geometric operations on a Dask geometry Series. E.g.

Conversions

Parquet support

Compatibility

For example, geodf.cx becomes spdf.spatial.cx.

Testing

jonmmease commented 5 years ago

@jbednar @jorisvandenbossche @philippjfr @jsignell

brendancol commented 5 years ago

@jonmmease this sounds awesome, and while it's a serious amount of work, at least GEOS already exists to be used as reference while implementing.

brendancol commented 5 years ago

Also, one anecdotal comment...

Two or three years ago, I attempted to implement an RTree using Numba and had trouble with numba optimizing trees with varying numbers of nodes in each level. It kept falling back to python mode and I gave up after the RTree couldn't beat a "brute force" query of the rectangles using numba in nopython mode.

3 thoughts:

  1. My issue could have been user error (i.e. my error).
  2. Numba may now be able to handle varying number of nodes in each level without falling back to python mode.
  3. Numba may still have this limitation and @sklam may be a good person to comment on this thread.
sklam commented 5 years ago

@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?

brendancol commented 5 years ago

@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?

So its been a long time since I looked at this...and certainly don't want to waste your time. The associated code is here: https://github.com/parietal-io/py-rbush

We were probably running tests, profiling, and then just stopped mid-stream...

To boil down the approach, the RTree was using numba classes with deferred types. A simple question you may be able to help answer is:

"Since 2017, have there been significant developments in numba JITed classes and deferred types?"

@sklam Maybe we could think up a simple example (like the BTree example) which uses a varying number of nodes per level to help reproduce this issue and / or demonstrate an approach.

@jonmmease sorry to take over this thread...we can move this numba discussion somewhere else...

jonmmease commented 5 years ago

@brendancol, no problem. I'm actually playing around with a numba RTree implementation right now.

For the use cases I have in mind, I don't think there's a need to iteratively update the RTree after creation. In this case, I think I could get away with an array-based implementation of the binary tree (something like https://www.geeksforgeeks.org/binary-tree-array-implementation/). At least that's what I'm trying at the moment. No performance to report yet :slightly_smiling_face:

brendancol commented 5 years ago

@jonmmease Awesome! I think part of my problem was probably "thinking by analogy"...Myself and @chbrandt were doing a port of https://github.com/mourner/rbush

sklam commented 5 years ago

Numba has support for a typed version of list, dict and heapq now. It would make things easier.

jbednar commented 5 years ago

Jon, this sounds like a fabulous proposal to me! Some musings:

jonmmease commented 5 years ago

@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py

I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the rtree package, and its much faster to create the initial index than rtree. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.

brendancol commented 5 years ago

@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py

I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the rtree package, and its much faster to create the initial index than rtree. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.

@jonmmease Sick!!! So excited to use this! Also another advantage to consider...libspatialindex isn't threadsafe and I've seen a bunch of seg faults related to rtree and multiprocessing. This will be super helpful for the Python GIS community!

Question: This has value outside of pandas. New standalone RTree library?

jonmmease commented 5 years ago

Question: This has value outside of pandas. New standalone RTree library?

Maybe eventually, sure. I'll probably want to wait to make sure it actually handles the usecases here before splitting it off though.

jorisvandenbossche commented 5 years ago

@jonmmease thanks a lot for writing up this proposal!

To repeat what I said in the datashader issue: I think there is certainly value in having a good, reusable implementation of geometry data support with a ragged array representation. Below already some comments on the goals (from a geopandas developer, so somewhat critical :-)), will try write down more comments on the rest of the discussion later.

About the goals:

Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this. But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well (it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)). And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood.

So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.

One possible way (but need to think this through more) is to make the dask extensions, IO functionality, etc to work with different kinds of GeometryArrays (if they can rely on the public methods of those, that might be possible). But will think about this aspect some more.

jbednar commented 5 years ago

Just to clarify, goal 4 is only in service of the other goals -- anything we work on needs to have some sort of persistence/serialization format, so that it's practical to work with, and the goal is not to provide one (if at all possible), just for there to be one. So it would be ideal if there is such a format available that both GeoPandas and this non-GeoPandas code can both use.

In any case, I'm very excited about the potential for having algorithms, persistence formats, etc. that can work with both GEOS and a Python-native, Numba-izable object, which would be amazing if we can achieve that.

jonmmease commented 5 years ago

But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well...

Thanks, yeah that certainly makes sense. Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?

(it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)). And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood... So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.

ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.

It might be nice if there was some general concept of ExtensionArray metadata that the Dask-level ExtensionArray would have access to for each of it's partitions. Ideally, there would be a standard way to store this metadata in the parquet partitions so that it would automatically be persisted when using the standard dask parquet functions.

jorisvandenbossche commented 5 years ago

Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?

For geopandas we are currently working on parquet support, using WKB (https://github.com/geopandas/geopandas/pull/1180/). Which is different that nested lists, but I think we good metadata it should be possible to support different formats.

ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.

Know that the dask-geopandas repo is outdated (so probably won't work with current dask), and was only a small proof of concept. But I think the main question here is how extend dask with the spatial partitioning information. I was under the assumption that that would require a subclass (see also https://github.com/dask/dask/issues/5438). Currently dask dataframes use the index for partition bounds, but a subclass could override that with spatial bounds. Whether it is possible to add this as metadata only to the geometry columns and have this information available in the algorithms (which would avoid the need of a subclass), that I don't know.

jbednar commented 4 years ago

@jonmmease, I've updated your initial comment on this issue to put checkboxes on all of the features that I think have now been implemented, and unticked checkboxes on all those not yet implemented. If you get a chance, it would be great if you could skim through that list and check my guesses, to see if I got any wrong that you can update.

jonmmease commented 4 years ago

Looks good, I updated a couple check boxes:

I didn't make any changes for this above, but I didn't end up using the data frame accessor paradigm. It turned out to be cleaner to use subclasses the way GeoPandas does. In the beginning, didn't realize how much support panda and dask.dataframe already had for supporting custom subclasses. So this ended up being easier, and resulted in a more GeoPandas compatible API.

benbovy commented 4 years ago

This project looks very interesting!

On a related topic, @willirath and I started working on xoak, which extends xarray with spatial indexes (currently based on scikit-learn's BallTree, as a proof-of-concept).

From what I've read here, we have many goals in common, i.e., domain-agnostic features and package dependencies (only PyData core libraries like Numba and Dask), support parallel/distributed indexing with Dask (see https://github.com/ESM-VFC/xoak/issues/1#issuecomment-650084718 for a concrete use case).

The rtree index implemented in spatialpandas thus has certainly a lot of value outside of Pandas, and I'd be very excited to contribute to a standalone library that could be reused by xarray as well. Supporting such index is part of xarray's development roadmap, that we are going to tackle very soon. More specifically, it would be great to have a more flexible implementation like in scikit-learn where BallTree and KdTree are built on top of a common, static binary tree, but here using Numba rather than Cython (although I don't know if it would be easy to do this with Numba).

jonmmease commented 4 years ago

Hi @benbovy, I'm not so involved with this project any more, but I think it would be great to pull out the Hilbert RTree implementation into a standalone package if you would find it useful! Have you looked through the implementation, and does the approach make sense to you? We don't have any specific documentation for it at this point, but it's well tested against RTree and I think the API and code are fairly well documented through docstrings and inline comments.

I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.

jbednar commented 4 years ago

I'm the de facto owner of spatialpandas right now, but only reluctantly, in that it fills a key unmet need for our HoloViz.org visualization toolchains but I think it really should have a life outside of that purpose and I'd love someone else to run with it from here. In the meantime, I would be very happy to see any of its code disappear into any well-maintained library that we could then simply import, assuming that such a library only depends on the core PyData libraries (xarray, pandas, numpy, Numba, Dask, etc.).

benbovy commented 4 years ago

I tend to agree with @jorisvandenbossche's comment

Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this.

I'm wondering if something like Awkward wouldn't help eventually. I've been watching the SciPy 2020 presentation and was pretty amazed, especially when seeing how it integrates with numba.

Awkward is still in heavy development and I don't know if how it will evolve in the future, but it has much potential to be reused for goal 1 without the need to maintain independent implementations of numpy/numba compatible ragged-arrays, IMO. I think it could even provide a common basis for supporting spatial structures in both (geo)pandas and xarray (see, e.g., https://github.com/scikit-hep/awkward-1.0/issues/350 and https://github.com/scikit-hep/awkward-1.0/issues/27#issuecomment-665526222).

Regarding goal 2, I'd love to see flexible numba implementation of spatial indexes in a well-maintained, separate library. I think that some people coming from the xarray side (including me) would be willing to contribute on a regular basis. Here too Awkward's numba integration might help if we need to deal with more complex (dynamic) data structures than simple arrays.

benbovy commented 4 years ago

Have you looked through the implementation, and does the approach make sense to you?

I had a brief look at it. I haven't worked extensively on the implementation of such algorithms yet, but overall the code seems already pretty clear and well documented to me, thanks for that @jonmmease!

I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.

Will do if needed, thanks!

brendancol commented 4 years ago

@jonmmease @benbovy I really like the roadmap here! I'm a datashader contributor and maintainer of xarray-spatial

Xarray-Spatial is looking to add a numba-ized version of GDAL's polygonize and we targeting spatialpandas.PolygonArray as the return type. My hope is that I'll get up-to-speed with spatialpandas internals and then can actually help with contributions.

Keep me in mind for as a contributor, happy to chat features / do zooms.

jbednar commented 4 years ago

@benbovy , Awkward is really cool! We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better, but my own (sketchy) recollection was that it was very promising but not quite ready to do what we needed at the time. Note that spatialpandas already does rely on Numba, so I don't think there's necessarily any special benefit to using Awkward for that particular purpose.

jonmmease commented 4 years ago

We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better...

I think it was mostly a matter of what was already supported in Awkward as of last fall when I started working on spatialpandas. In particular we were looking for native Arrow/Parquet serialization, which it looks like was recently added to Akward (https://github.com/scikit-hep/awkward-1.0/pull/343).

I didn't get far enough into researching Awkward to understand whether it possible to use a schema to define the number of allowed nesting levels. There would also be the question of how to distinguish geometry types that share identical representations (e.g. MultiLine and Polygon).

Most of the complexity in spatialpandas that isn't directly related to geometric algorithms is in the geometry extension array implementations, so it would definitely be a win in terms of reduced code complexity of Awkward could be adopted.

jorisvandenbossche commented 3 years ago

Sorry to hijack this thread, but thought to give a heads up here since several potentially interested people might follow this one (also cc @brl0).

During the upcoming Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/).

First of all: you're of course all welcome to attend/participate! But my specific question: it would be interesting to have someone from spatialpandas (or familiar more familiar with spatialpandas) give a brief presentation about spatialpandas. There is quite some potential overlap with dask-geopandas (in goals and implementation), and I think it would be interesting to discuss potential synergies / better compatibility between both.

We can also follow-up on https://github.com/geopandas/community/issues/4 (general issue about the workshop).

jonmmease commented 3 years ago

@jorisvandenbossche. I don't know what @jbednar's current thinking is regarding spatialpandas, but I don't plan to personally work on it any further.

That said, I'd be happy to summarize what's been done in case there are any ideas worth learning from / stealing. I think the main idea that would be relevant to the Dask discussion would be the use of a space filling curve and R-tree to make it possible to use the regular Dask DataFrame partition index as a spatial index as well (rather than creating a whole new Dask data structure).

Does that sound good?

jbednar commented 3 years ago

I think spatialpandas is great, but at Anaconda we are only planning to take it forward if we get specific funding to do so or if it's required for a specific project. Otherwise we just plan to fix bugs in it when that's required. I'd hope @brl0 would be able to push it forward. We are also looking to Awkward for possible implementations of ragged arrays that Datashader can use.

I'd love for any elements of SpatialPandas that are useful to GeoPandas or Dask-GeoPandas to migrate to one of those projects, but note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal, and thus as far as I know cannot simply delete SpatialPandas in favor of Dask-GeoPandas even if Dask-GeoPandas subsumed SpatialPandas functionality.

I'll try to attend the workshop, but @jonmmease knows a lot more about the internals, so I would just discuss such scoping and roadmap issues.

jorisvandenbossche commented 3 years ago

@jonmmease indeed, the spatial (re)partitioning concepts of spatialpandas are most relevant I think for the discussion, so a brief overview of the items you list would be great. Thanks a lot in advance!

note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal

Fiona/gdal is not a hard requirement for GeoPandas, you only need it if you want to use it for IO (this has always been the case, but it's only since the last release we actually stopped importing fiona on geopandas import (which created this seemingly hard dependency), but only import it when being used). The only "non-Python" required dependency is GEOS, which is a simple C++ library without any external dependencies (I don't think it's a harder dependency as eg numpy).

jbednar commented 3 years ago

For us "dependency" is defined both in terms of "will the software be usable by us for what we want to do, without that package" and "does either the pip or the conda package depend on that"? If GeoPandas can make both of those dependencies go away, I'd love to see some reunification here!

jonmmease commented 3 years ago

One thing that @jorisvandenbossche and I talked about a while back is whether it would be practical for GeoPandas to define an interface for the "Geometry Column" so that something like the spatialpandas contiguous memory representation could be used as an alternative to the default array of GEOS/Shapely objects representation.

I believe this functionality, combined with dask-geopandas and the removal of fiona/gdal as hard dependencies, is everything that would be needed for Datashader to be able to switch from spatialpandas to GeoPandas. And maybe in that scenario, spatialpandas shrinks down to only consist of the implementation of the Geometry Column interface.

jbednar commented 3 years ago

And then we can sneak that alternative Geometry Column interface as an option in GeoPandas somewhere and our work here is done! Oh, did I say that out loud? :-)