Open jonmmease opened 5 years ago
@jbednar @jorisvandenbossche @philippjfr @jsignell
@jonmmease this sounds awesome, and while it's a serious amount of work, at least GEOS already exists to be used as reference while implementing.
Also, one anecdotal comment...
Two or three years ago, I attempted to implement an RTree using Numba and had trouble with numba optimizing trees with varying numbers of nodes in each level. It kept falling back to python mode and I gave up after the RTree couldn't beat a "brute force" query of the rectangles using numba in nopython mode.
3 thoughts:
@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?
@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?
So its been a long time since I looked at this...and certainly don't want to waste your time. The associated code is here: https://github.com/parietal-io/py-rbush
We were probably running tests, profiling, and then just stopped mid-stream...
To boil down the approach, the RTree was using numba classes with deferred types. A simple question you may be able to help answer is:
"Since 2017, have there been significant developments in numba JITed classes and deferred types?"
@sklam Maybe we could think up a simple example (like the BTree example) which uses a varying number of nodes per level to help reproduce this issue and / or demonstrate an approach.
@jonmmease sorry to take over this thread...we can move this numba discussion somewhere else...
@brendancol, no problem. I'm actually playing around with a numba RTree implementation right now.
For the use cases I have in mind, I don't think there's a need to iteratively update the RTree after creation. In this case, I think I could get away with an array-based implementation of the binary tree (something like https://www.geeksforgeeks.org/binary-tree-array-implementation/). At least that's what I'm trying at the moment. No performance to report yet :slightly_smiling_face:
@jonmmease Awesome! I think part of my problem was probably "thinking by analogy"...Myself and @chbrandt were doing a port of https://github.com/mourner/rbush
Numba has support for a typed version of list
, dict
and heapq
now. It would make things easier.
Jon, this sounds like a fabulous proposal to me! Some musings:
@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py
I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the rtree
package, and its much faster to create the initial index than rtree
. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.
@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py
I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the
rtree
package, and its much faster to create the initial index thanrtree
. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.
@jonmmease Sick!!! So excited to use this! Also another advantage to consider...libspatialindex isn't threadsafe and I've seen a bunch of seg faults related to rtree
and multiprocessing. This will be super helpful for the Python GIS community!
Question: This has value outside of pandas. New standalone RTree library?
Question: This has value outside of pandas. New standalone RTree library?
Maybe eventually, sure. I'll probably want to wait to make sure it actually handles the usecases here before splitting it off though.
@jonmmease thanks a lot for writing up this proposal!
To repeat what I said in the datashader issue: I think there is certainly value in having a good, reusable implementation of geometry data support with a ragged array representation. Below already some comments on the goals (from a geopandas developer, so somewhat critical :-)), will try write down more comments on the rest of the discussion later.
About the goals:
Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this. But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well (it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)). And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood.
So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.
One possible way (but need to think this through more) is to make the dask extensions, IO functionality, etc to work with different kinds of GeometryArrays (if they can rely on the public methods of those, that might be possible). But will think about this aspect some more.
Just to clarify, goal 4 is only in service of the other goals -- anything we work on needs to have some sort of persistence/serialization format, so that it's practical to work with, and the goal is not to provide one (if at all possible), just for there to be one. So it would be ideal if there is such a format available that both GeoPandas and this non-GeoPandas code can both use.
In any case, I'm very excited about the potential for having algorithms, persistence formats, etc. that can work with both GEOS and a Python-native, Numba-izable object, which would be amazing if we can achieve that.
But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well...
Thanks, yeah that certainly makes sense. Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?
(it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)). And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood... So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.
ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.
It might be nice if there was some general concept of ExtensionArray metadata that the Dask-level ExtensionArray would have access to for each of it's partitions. Ideally, there would be a standard way to store this metadata in the parquet partitions so that it would automatically be persisted when using the standard dask parquet functions.
Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?
For geopandas we are currently working on parquet support, using WKB (https://github.com/geopandas/geopandas/pull/1180/). Which is different that nested lists, but I think we good metadata it should be possible to support different formats.
ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.
Know that the dask-geopandas repo is outdated (so probably won't work with current dask), and was only a small proof of concept. But I think the main question here is how extend dask with the spatial partitioning information. I was under the assumption that that would require a subclass (see also https://github.com/dask/dask/issues/5438). Currently dask dataframes use the index for partition bounds, but a subclass could override that with spatial bounds. Whether it is possible to add this as metadata only to the geometry columns and have this information available in the algorithms (which would avoid the need of a subclass), that I don't know.
@jonmmease, I've updated your initial comment on this issue to put checkboxes on all of the features that I think have now been implemented, and unticked checkboxes on all those not yet implemented. If you get a chance, it would be great if you could skim through that list and check my guesses, to see if I got any wrong that you can update.
Looks good, I updated a couple check boxes:
MultiPoint
/MultiPointArray
are suportedbounds
is supportedI didn't make any changes for this above, but I didn't end up using the data frame accessor paradigm. It turned out to be cleaner to use subclasses the way GeoPandas does. In the beginning, didn't realize how much support panda and dask.dataframe already had for supporting custom subclasses. So this ended up being easier, and resulted in a more GeoPandas compatible API.
This project looks very interesting!
On a related topic, @willirath and I started working on xoak, which extends xarray with spatial indexes (currently based on scikit-learn's BallTree
, as a proof-of-concept).
From what I've read here, we have many goals in common, i.e., domain-agnostic features and package dependencies (only PyData core libraries like Numba and Dask), support parallel/distributed indexing with Dask (see https://github.com/ESM-VFC/xoak/issues/1#issuecomment-650084718 for a concrete use case).
The rtree
index implemented in spatialpandas
thus has certainly a lot of value outside of Pandas, and I'd be very excited to contribute to a standalone library that could be reused by xarray as well. Supporting such index is part of xarray's development roadmap, that we are going to tackle very soon. More specifically, it would be great to have a more flexible implementation like in scikit-learn where BallTree
and KdTree
are built on top of a common, static binary tree, but here using Numba rather than Cython (although I don't know if it would be easy to do this with Numba).
Hi @benbovy, I'm not so involved with this project any more, but I think it would be great to pull out the Hilbert RTree implementation into a standalone package if you would find it useful! Have you looked through the implementation, and does the approach make sense to you? We don't have any specific documentation for it at this point, but it's well tested against RTree
and I think the API and code are fairly well documented through docstrings and inline comments.
I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.
I'm the de facto owner of spatialpandas right now, but only reluctantly, in that it fills a key unmet need for our HoloViz.org visualization toolchains but I think it really should have a life outside of that purpose and I'd love someone else to run with it from here. In the meantime, I would be very happy to see any of its code disappear into any well-maintained library that we could then simply import, assuming that such a library only depends on the core PyData libraries (xarray, pandas, numpy, Numba, Dask, etc.).
I tend to agree with @jorisvandenbossche's comment
Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this.
I'm wondering if something like Awkward wouldn't help eventually. I've been watching the SciPy 2020 presentation and was pretty amazed, especially when seeing how it integrates with numba.
Awkward is still in heavy development and I don't know if how it will evolve in the future, but it has much potential to be reused for goal 1 without the need to maintain independent implementations of numpy/numba compatible ragged-arrays, IMO. I think it could even provide a common basis for supporting spatial structures in both (geo)pandas and xarray (see, e.g., https://github.com/scikit-hep/awkward-1.0/issues/350 and https://github.com/scikit-hep/awkward-1.0/issues/27#issuecomment-665526222).
Regarding goal 2, I'd love to see flexible numba implementation of spatial indexes in a well-maintained, separate library. I think that some people coming from the xarray side (including me) would be willing to contribute on a regular basis. Here too Awkward's numba integration might help if we need to deal with more complex (dynamic) data structures than simple arrays.
Have you looked through the implementation, and does the approach make sense to you?
I had a brief look at it. I haven't worked extensively on the implementation of such algorithms yet, but overall the code seems already pretty clear and well documented to me, thanks for that @jonmmease!
I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.
Will do if needed, thanks!
@jonmmease @benbovy I really like the roadmap here! I'm a datashader contributor and maintainer of xarray-spatial
Xarray-Spatial is looking to add a numba-ized version of GDAL's polygonize
and we targeting spatialpandas.PolygonArray
as the return type. My hope is that I'll get up-to-speed with spatialpandas internals and then can actually help with contributions.
Keep me in mind for as a contributor, happy to chat features / do zooms.
@benbovy , Awkward is really cool! We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better, but my own (sketchy) recollection was that it was very promising but not quite ready to do what we needed at the time. Note that spatialpandas already does rely on Numba, so I don't think there's necessarily any special benefit to using Awkward for that particular purpose.
We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better...
I think it was mostly a matter of what was already supported in Awkward as of last fall when I started working on spatialpandas
. In particular we were looking for native Arrow/Parquet serialization, which it looks like was recently added to Akward (https://github.com/scikit-hep/awkward-1.0/pull/343).
I didn't get far enough into researching Awkward to understand whether it possible to use a schema to define the number of allowed nesting levels. There would also be the question of how to distinguish geometry types that share identical representations (e.g. MultiLine
and Polygon
).
Most of the complexity in spatialpandas that isn't directly related to geometric algorithms is in the geometry extension array implementations, so it would definitely be a win in terms of reduced code complexity of Awkward could be adopted.
Sorry to hijack this thread, but thought to give a heads up here since several potentially interested people might follow this one (also cc @brl0).
During the upcoming Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/).
First of all: you're of course all welcome to attend/participate! But my specific question: it would be interesting to have someone from spatialpandas (or familiar more familiar with spatialpandas) give a brief presentation about spatialpandas. There is quite some potential overlap with dask-geopandas (in goals and implementation), and I think it would be interesting to discuss potential synergies / better compatibility between both.
We can also follow-up on https://github.com/geopandas/community/issues/4 (general issue about the workshop).
@jorisvandenbossche. I don't know what @jbednar's current thinking is regarding spatialpandas, but I don't plan to personally work on it any further.
That said, I'd be happy to summarize what's been done in case there are any ideas worth learning from / stealing. I think the main idea that would be relevant to the Dask discussion would be the use of a space filling curve and R-tree to make it possible to use the regular Dask DataFrame partition index as a spatial index as well (rather than creating a whole new Dask data structure).
Does that sound good?
I think spatialpandas is great, but at Anaconda we are only planning to take it forward if we get specific funding to do so or if it's required for a specific project. Otherwise we just plan to fix bugs in it when that's required. I'd hope @brl0 would be able to push it forward. We are also looking to Awkward for possible implementations of ragged arrays that Datashader can use.
I'd love for any elements of SpatialPandas that are useful to GeoPandas or Dask-GeoPandas to migrate to one of those projects, but note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal, and thus as far as I know cannot simply delete SpatialPandas in favor of Dask-GeoPandas even if Dask-GeoPandas subsumed SpatialPandas functionality.
I'll try to attend the workshop, but @jonmmease knows a lot more about the internals, so I would just discuss such scoping and roadmap issues.
@jonmmease indeed, the spatial (re)partitioning concepts of spatialpandas are most relevant I think for the discussion, so a brief overview of the items you list would be great. Thanks a lot in advance!
note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal
Fiona/gdal is not a hard requirement for GeoPandas, you only need it if you want to use it for IO (this has always been the case, but it's only since the last release we actually stopped importing fiona on geopandas import (which created this seemingly hard dependency), but only import it when being used). The only "non-Python" required dependency is GEOS, which is a simple C++ library without any external dependencies (I don't think it's a harder dependency as eg numpy).
For us "dependency" is defined both in terms of "will the software be usable by us for what we want to do, without that package" and "does either the pip or the conda package depend on that"? If GeoPandas can make both of those dependencies go away, I'd love to see some reunification here!
One thing that @jorisvandenbossche and I talked about a while back is whether it would be practical for GeoPandas to define an interface for the "Geometry Column" so that something like the spatialpandas contiguous memory representation could be used as an alternative to the default array of GEOS/Shapely objects representation.
I believe this functionality, combined with dask-geopandas and the removal of fiona/gdal as hard dependencies, is everything that would be needed for Datashader to be able to switch from spatialpandas to GeoPandas. And maybe in that scenario, spatialpandas shrinks down to only consist of the implementation of the Geometry Column interface.
And then we can sneak that alternative Geometry Column interface as an option in GeoPandas somewhere and our work here is done! Oh, did I say that out loud? :-)
spatialpandas
Pandas and Dask extensions for vectorized spatial and geometric operations.
This proposal is a plan towards extracting the functionality of the spatial/geometric utilities developed in Datashader into this separate, general-purpose library. Some of the functionality here has now (as of mid-2020) been implemented as marked below, but much remains to be done!
Goals
This project has several goals:
Points
,Polygons
, etc.). Unlike the GeoPandas GeoSeries, there will be a separate extension array for each geometry type (PointsArray
,PolygonsArray
, etc.), and the underlying representation for the entire array will be stored in a single contiguous ragged array that is suitable for fast processing using numba.SpatialPointsFrame
and would support all geometry types, not only points.These features will make it very efficient for Datashader to process large geometry collections. They will also make it more efficient for HoloViews to perform linked selection on large spatially distributed datasets.
Non-goals
spatialpandas will be focused on geometry only, not geography. As such:
Features
Extension Arrays
The
spatialpandas.geometry
package will contain geometry classes and pandas extension arrays for each geometry typePoint
class.Points
class.LineString
andMultiLineString
classes.LinearRing
class.Polygon
/MultiPolygon
classes.Spatial Index
[x] The
spatialpandas.spatialindex
module will contain a vectorized and parallel numba implementation of a Hilbert-RTree.[x] Each extension array has an
sindex
property that holds a lazily generated spatial index.Extension Array Geometric Operations
The extension arrays above will have methods/properties for shapely-compatible geometric operations. These are implemented as parallel vectorized numba functions. Supportable operations include:
Only a minimal subset of these will be implemented by the core developers, but others can be added relatively easily by users and other contributors by starting with one of the implemented methods as an example, then adding code from other published libraries (but Numba-ized and Dask-ified if possible!).
Pandas accessors
Custom pandas Series accessor is included to expose these geometric operations at the Series level. E.g.
df.column.spatial.area
returns a pandas Series with the same index as df, containing area values.df.column.spatial.cx
is a geopandas-style spatial indexer for filtering a series spatially using a spatial index.Custom pandas DataFrame accessor is included to track the current "active" geometry column for a DataFrame, and provide DataFrame level operations. E.g.
df.spatial.cx
will filter a dataframe spatially based on the current active geometry column.Dask accessors
A custom Dask Series accessor is included to expose geometric operations on a Dask geometry Series. E.g.
ddf.column.spatial.area
returns a dask Series with the same index as ddf, containing area values.[x] The accessor also holds a spatial index of bounding box of the shapes in each partition. This allows spatial operations (e.g. cx, spatial join) to skip processing entire partitions that will not contain relevant geometries.
[x] A Custom dask DataFrame accessor is included that is exactly analogous to the pandas version.
Conversions
Parquet support
[x] read/to parquet for Pandas DataFrames will be able to rely on the standard pandas parquet functions with extension array support.
[x] Special read/to parquet functions will be provided for Dask DataFrames.
[x] to_parquet will add extra metadata to the parquet file to specify the geometric bounds of each partition. There will also be the option for
to_parquet
to use Hilbert R-tree packing to optimize the partitions for later spatial access on read.[x]
read_parquet
will read this partition bounding-box metadata and use it to pre-populate the spatial accessor for each geometry column with a partition-level spatial index.Compatibility
For example,
geodf.cx
becomesspdf.spatial.cx
.Testing