`numba`/parallel implementation of `_area_tables_binning`

darribas commented 3 years ago

I have done a first pass at trying to accelerate areal interpolation with numba and multi-core implementation. Results are on a branch of my fork, the gist is here:

https://github.com/darribas/tobler/blob/numba_binning/numba_tests.ipynb

If you look over the timings, it really depends on the characteristics of the transfer: the number of source Vs target polygons, and perhaps the number of intersections to be performed.

Here're some thoughts to be further explored/discussed:

I think the first pass where we build the buckets is always faster with numba, it's doing the same all within numba and I also removed the creation of a couple of objects that were not required
One of the tradeoff's between pygeos and the pure-python (we can't do numba here) is in the intersections stage, in the third big loop. A for loop in pure python is slow but it allows us to do all the operations for each candidate within it. With pygeos, loops are super fast but we need to implicitly do a few of them (one for intersects, one for intersection and one for area). I think this might be behind some of the slowing
The other area I don't fully understand yet is the parallel implementation of the third loop. Here we have a set of candidates and need to check if they a) intersect, b) intersect them if so, and c) calculate their area. I'm splitting all the candidates into chunks for each core for a), regrouping them and then splitting them again to do b) and c) in parallel cores as well. Maybe for the sizes I've tested this is too much overhead?

I think the results are encouraging, but I also think we need to explore them further to come up with the improvement that'll give us most bang for the buck.

Any sort of feedback, comment or collaboration most welcome, thanks!

cc @sjsrey @knaaptime @martinfleis

sjsrey commented 3 years ago

This is a fantastic start!

A few thoughts:

we might want to move the intersects logic into lib as we could also optimize the weights builders since finding potential neighbors is a key part of our contiguity builder, and I lifted ides from the lib code to start the binning in the area interpolation. Of course, this can wait until we settle on tobler enhancements.
If it is true that there is a large difference in speed depending on the combination of the number of source, number of target, and intersection density, then maybe some kind of conditional scheme is worth exploring? It might help to unpack the third loop you mention?

darribas commented 3 years ago

Alright, quick update on this front: @martinfleis figured out relying on vanilla geopandas, which pulls the super fast STRTree spatial index from pygeos now is a lot faster and, after a few tests and comparisons, this is ready to merge in #110 as far as Martin and I are concerned.

Now, this speeds things up on a purely single-core strategy. We are working on figuring out the logic and the machinery to paralellise some of the steps in the allocation, which are very doable. To document where we are at right now:

There're two key steps that are "embarrasingly parallel":
1. Once a spatial index is built on one table, "bulk query" the other table
2. Once you know which source-target polygons intersect from 1., compute the area of the intersection
Relying on the master versions of pygeos and geopandas, I have a working implementation of 1.ii in the branch linked above that uses the same logic we built for esda.crand and seems to work very well. It's important that the master versions are used because pygeos did not allow pickling of GeometryArray objects until very recently.
For 1.i @martinfleis have some ideas but they rely on the STRTree object being picklable, which is currently not. Martin is inquiring this with the Geopandas/Pygeos geopandas#1746 teams to see if it's an option and move forward in that direction.

This leaves us with what to do for the next PySAL release. A few options:

We merge #110 and release only single core Jan'21
We merge #110 and work on an extension of that patch that uses multi-core for 1.ii, both released in Jan'21
We merge #110 and work on an extension of that patch that uses multi-core for 1.i and 1.ii, both released in Jan'21
We leave 3. (and maybe 2.) for July'21

Given 2. and 3. rely on bleeding edge versions, my suggestion would be to push them as "experimental", with good documentation (default always to single core and make explicit the requirements for multi-core to work). 1.ii is a matter of time until it's officially released so, going forward things will only be easier. At the same time, pushing them out allows us to test the features (Martin and I could really use them for a current project) and figure out the best way to get them out of the "experimental" stage in July'21.

Now, 1.i is not as straightforward as we don't know yet whether it'll be possible to pickle spatial indices. If it is however, the resulting version of area interpolation would be very performant and able to take advantage of modern hardware.

What do you think?

knaaptime commented 3 years ago

if #110 is good to go, i say we merge it and cut another release. If 1i and 1ii are ready by the end of jan, lets do another release and they can 'officially' make it into the meta release. With the way the metapackage is structured now, it doesn't actually matter a ton. I'm happy to keep cutting new releases of tobler as these features mature, and any fresh install of the metapackage will have the latest version of tobler

darribas commented 3 years ago

Just to centralise discussion, #110 is merged and the parallel version is being worked on at #112

darribas commented 3 years ago

UPDATE: #112 is ready to go as far as @martinfleis and I are concerned. The parallel implementation does not scale linearly but it does show an important speed up (on 16 cores it's about 5x faster roughly). There are tests and it's hooked into the interpolation method (via an n_jobs parameter that dispatches to single or multi-core). This does need edge versions of pygeos and geopandas (currently released ones do not let you pickle GeometryArrays), but since the default is n_jobs=1, it should not break anything.

My suggestion would be to add it to the new release with a warning that it's experimental and the version requirements. This will allow folks who want to take it for a spin to easily play with it and hopefully identify if there are any issues. Once pygeos and geopandas releases catch up, we can potentially move multi-core to the default.

What do you think @knaaptime / @sjsrey ?

knaaptime commented 3 years ago

should we close this @darribas ?

darribas commented 3 years ago

Yep! It was an interesting exploration while it lasted :laughing:

martinfleis commented 3 years ago

Hey, just for a record, we played a bit more with this and have some new findings. All below is based on LSOA geometry of England and Wales (34k detailed polygons as a target) and CORINE Land Cover polygons (some very large, some small, some non-overlapping LSOA, 54k in total).

Profiling of a single-core implementation:

%time ids_src, ids_tgt = lsoa.sindex.query_bulk(clc2000.geometry, predicate="intersects")
%time ints = clc2000.geometry.values[ids_src].intersection(lsoa.geometry.values[ids_tgt])
%time areas = ints.area

CPU times: user 9.38 s, sys: 131 ms, total: 9.51 s
Wall time: 9.51 s
CPU times: user 1min 33s, sys: 9.51 ms, total: 1min 33s
Wall time: 1min 33s
CPU times: user 161 ms, sys: 0 ns, total: 161 ms
Wall time: 160 ms

You can see that the major bottleneck is intersection.

Profiling of the same steps using joblib parallel implementation:

spatial indexing:
Wall time: 23.5 s

intersection + area:
Wall time: 1 m

Surprisingly, it did not make much difference whether I used 2 or 16 workers.

Finally, I did an experiment with dask-geopandas which is by far the fastest and cleanest at the moment.

%%time
ids_src, ids_tgt = lsoa.sindex.query_bulk(clc2000.geometry, predicate="intersects")
df = gpd.GeoDataFrame({'clc': clc2000.geometry.values[ids_src], 'lsoa': lsoa.geometry.values[ids_tgt]})
ddf = dgpd.from_geopandas(df, npartitions=1024)
areas = ddf.clc.intersection(ddf.lsoa).area.compute()

CPU times: user 2min 56s, sys: 805 ms, total: 2min 56s
Wall time: 19.4 s

This is not ready to be merged to tobler since dask-geopandas is still experimental and not released. However, there are plans to add distributed spatial indexing in there so it can in theory squeeze the time even below this.

I generally believe, that dask and dask-geopandas is the best way forward for parallelization at the moment and we should put effort there instead of creating ad-hoc solutions for each application.

pysal / tobler

`numba`/parallel implementation of `_area_tables_binning` #104