pysal / tobler

Spatial interpolation, Dasymetric Mapping, & Change of Support
https://pysal.org/tobler
BSD 3-Clause "New" or "Revised" License
144 stars 30 forks source link

issues with dask interpolate #185

Open knaaptime opened 10 months ago

knaaptime commented 10 months ago

i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In this gist I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:

So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:

knaaptime commented 10 months ago

a couple updates after looking at this a bit.

On the same dataset in the gist,_area_tables_binning takes 4 minutes, which confirms it's building the area-of-intersection sparse matrix that consumes all the time when n gets large (so yeah, we need to think about dask as a replacement for joblib in _area_tables_binning_parallel, then operations on the resulting sparse matrix should all be fast).

It's still not clear to me why extensive and intensive aren't working, because we should be able to aggregate the results from area_tables_interpolate from the distributed dask dataframes the same way as we do for categorical. The only difference between categorical and intensive/extensive in the single core implementation is that categorical just does a row-sum across the table (masked by different categorical values) whereas the extensive/intensive first do a row-transform, then a dot product with the ext/int variables. Since the intersection table is correct (otherwise categorical couldnt be correct), its something about the math that's off, and my best guess would be it's the row-transform (maybe we're missing important neighbors in other partitions?). I dunno.

Either way, if we use dask to distribute building the values of the AoI table, (i.e. replace _area_tables_binning_parallel, moving the parallelization down a level from where it is now) i think it would solve that issue anyway. Currently, we get a speedup with dask because we're breaking the geodataframes into spatial partitions and building several AoI tables, operating on them, and aggregating the results. But i think we need to (use dask to) break the pieces apart to build a single 'table' (sparse aoi matrix), then resume business as usual. Like with the new libpysal.Graph, i think maybe the way to handle this is to build the AoI table as an adjlist in the distributed dask df, then bring the chunks back and pivot to sparse, then send the table along

what's the purpose of category_vars? The categorical variables should already come back formatted like that, so this block is basically just doing the same thing but forcing categorical instead of using unique

ok, i see dask needs the metadata ahead of time, so we need to create the new columns, but still not sure we need to cast as categorical instead of looking at unique vals and allowing string type?

knaaptime commented 10 months ago

cc @martinfleis @ljwolf @sjsrey in case anyone else has thoughts

darribas commented 10 months ago

Thanks very much for chipping in @knaaptime !!!

A few thoughts, follow up's:

knaaptime commented 10 months ago

I'm also still a bit not 100% sure why (at least extensive) doesn't work, but after looking at the single-core code, my conclusion was that the weighting done when row standardising (L. 316-321) requires knowing all the values, so if we weight on a chunk-by-chunk basis within each worker, those weights will be off. My conclusion then was, if that is the case, we're a bit screwed to make it run fast on Dask, because we then have to re-write the algo in Dask, rather than use it as a scheduler of chunks,

exactly. So we reimplement area_tables_binnning using dask as the backend instead of joblib, but thats basically just a matter of moving the logic over (as you've already done in #187)

Having said, that, I may be wrong or may not be seeing a bit that allows us to still speed up the single-machine implementation or I may just be wrong in the diagnosis and there's still a way of interpolating extensive/intensive vars fully on a chunk-by-chunk basis.

I dont think we want to interpolate the chunks. The tests make clear that all the expensive computation is building the adjacency table, but multiplying through it is basically costless. Conversely, it is very expensive to try and send the multiplication through the distributed chunks. So in #187, lets just return the adjacency list, convert to (pandas not dask) multiindex series with the sparse dtype, then dump that out to sparse. Then we just pick up from right here, having already handled the expensive part

knaaptime commented 6 months ago

@darribas can you take a look at what's going wrong with the dask interpolator? I want to cut a new release this month but dont want to do so until everything is passing. If you dont have time i'll make the dask stuff private until its ready