[WIP] Area-weighted interpolation in Dask

darribas commented 11 months ago

This PR adds functionality to run area-weighted interpolation (AWI) through Dask. The code heavily leverages dask-geopandas (learning a lot from their distributed spatial join) and makes it possible to run AWI out-of-core (i.e., larger-than-memory datasets), distributed (i.e., across different cores or machines), and in parallel (with a significantly more efficient algorithm than the current multi-core implementation).

A notebook (for now on root, if this gets merged, it should be moved to notebooks) is provided that demonstrates how the method works, correctness, and performance.

I'd also like to use this PR conversation to discuss whether you think this is the right place to contribute and if you'd modify anything on the API/location of the code.

NOTE: currently, only interpolation of categorical variables is implemented. My preference, unless someone can provide an implementation for extensive/intensive variables quickly, would be this does not block merging (and possibly cutting a small release). Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested. I need to think (and would welcome thoughts!) how intensive/extensive should be done in this framework. I think it needs a second step of reweighting/interpolating when collecting chunks from the workers. For categorical, this step is simply adding up all the values for a given target geometry, but this is not the case for intensive/extensive.

Before merging, the following (at least) needs to be ticked off:

[x] Fix current "can't add float and str" bug that arises sometimes when many partitions are used (my hunch here is that it comes from workers returning None rather than a full-fledge empty table, but needs exploration). Edit: There's now an illustration of the bug in the notebook.
[x] add dask/daskgpd to the ci environment files to get the tests to execute
[x] Add tests
[x] Move demo notebook from source to notebooks folder
[x] Explore AWI on extensive and intensive variables.

I'm happy to push through as the discussion evolves to add the remaining bits. I'd love to see this merged.

knaaptime commented 11 months ago

sweet, I'll take a look. A couple quick reactions

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit
how heavy are dask/daskgpd? I assume we probably want those as optional dependencies
apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested.

it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

knaaptime commented 11 months ago

(need to add dask/daskgpd to the ci environment files to get the tests to execute)

darribas commented 11 months ago

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit

The notebook has already a bit on this. With the baseline data (in the small hundreds) it's clearly overkill but, as soon as you move to a few thousands, it seems worthwhile. The main catch with all this is that dask is just a way of better using better hardware. The example there is in a beefy machine with 16 cores, but if you have a machine with two cores you won’t see much benefit other than out of core support (i.e., when your data don't fit in memory).

how heavy are dask/daskgpd? I assume we probably want those as optional dependencies

I'm not sure what you mean by "heavy"? I think they're mostly python by not sure. Install is straightforward on conda/mamba. I've added them to the environment.yml on the PR and the notebook included is run on an environment created with that file. Dask is also widely embedded in the "stack" (e.g., xarray supports multi/out-of core with it by default, scikit-learn can use it for their grid search, etc.). For tobler I'd say, for now, an optional dependency would be the way to go.

apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

I can certainly add another illustration of how this can be used, but I'm not sure it'd add much? This PR really doesn’t add new functionality, tobler already had categorical support. It provides a dask based way of running it. But I can do another notebook with an illustration, it’s just that it’ll be the same user case as with the single core. The benefit is when you have tens/hundreds of thousands/millions of geometries to transfer, which I'm not sure I'd include as an example?

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested.

it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

It’s slightly different. We could think of a way of vectorizing pixels and doing a spatial dissolve with dask. I don’t know if that’d be faster (it'd be at least parallel/out-of-core), but it’s definitely different code (though similar philosophy), so I'd be tempted to leave that for a different PR, perhaps create an issue to remember this option in case we have bandwidth (or need) in the future to explore it.

darribas commented 11 months ago

UPDATE - I think I've fixed the bug where sometimes computation would bomb on the final groupby

knaaptime commented 11 months ago

what i'd like to do is give users a sense for when they should reach for this versus the usual function. So for that, it would be useful to have a more concrete example, maybe with an actual use case. I think it would be a real boon to folks if they understood what scale of dataset requires this function versus the normal one, and what actual questions you'd go about asking at that scale. The NYC taxi points thing Rocklin cooked up for the original dask-geopandas example was a nice example of this.

I have no doubts this is a useful addition to the codebase, but i just want to make sure we fully communicate that benefit to the users

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit

The notebook has already a bit on this. With the baseline data (in the small hundreds) it's clearly overkill but, as soon as you move to a few thousands, it seems worthwhile. The main catch with all this is that dask is just a way of better using better hardware. The example there is in a beefy machine with 16 cores, but if you have a machine with two cores you won’t see much benefit other than out of core support (i.e., when your data don't fit in memory).

the notebook mentions

NOTE - Timings below do not include computation time required for spatial shuffling and partitioning (which can be substantial with large datasets), or converting from geopandas. These are "sunk costs" that'll only make this approach preferable with large datasets, although they can be computed once and the result stored in disk efficiently (e.g., as Parquet files). Having said that, when "larger" is large enough is not very large in modern terms: from a handful of thousand observations the gains will be substantial if several cores/workers are available.

so it would be nice to see some examples that do include the time reqiured for shuffling and a sense for a real dataset that requires this kind of processing

how heavy are dask/daskgpd? I assume we probably want those as optional dependencies

I'm not sure what you mean by "heavy"? I think they're mostly python by not sure. Install is straightforward on conda/mamba. I've added them to the environment.yml on the PR and the notebook included is run on an environment created with that file. Dask is also widely embedded in the "stack" (e.g., xarray supports multi/out-of core with it by default, scikit-learn can use it for their grid search, etc.). For tobler I'd say, for now, an optional dependency would be the way to go.

yep, that was the question. Lets move these to function-level imports like we treat numba in libpysal and bokeh in splot (or h3 here in tobler). We can leave them in the conda-forge recipe so folks have access by default, but move them out of modul-level imports so they're not required

apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

I can certainly add another illustration of how this can be used, but I'm not sure it'd add much? This PR really doesn’t add new functionality, tobler already had categorical support. It provides a dask based way of running it. But I can do another notebook with an illustration, it’s just that it’ll be the same user case as with the single core. The benefit is when you have tens/hundreds of thousands/millions of geometries to transfer, which I'm not sure I'd include as an example?

What i mean is, one of the things the library struggles with is user documentation, and demonstrating to folks how to actually apply our tools. The current notebook is really focused on showing that the dask implementation matches the existing implementation, and that when you 40x a single dataset, the dask-scaling kicks in as it should. That's a really nice resource for us developers, but its still kinda opaque for users trying to figure out which tool to lift off the shelf. I can imagine lots of folks wondering if their data is "big enough" for this to matter, so in an ideal world, it would be nice to give them some orientation. Obviously we cant test datasets at a ton of different scales, but often a motivating use-case is more insightful than an abstract one (and it sounded like maybe you hand a land-use one on hand?). So that's what i was curious about

I definitely wouldnt say this is a blocker, and im happy to merge even if we dont build out more examples, but i wanted to raise the issue while there's some energy behind this

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested. it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

It’s slightly different. We could think of a way of vectorizing pixels and doing a spatial dissolve with dask. I don’t know if that’d be faster (it'd be at least parallel/out-of-core), but it’s definitely different code (though similar philosophy), so I'd be tempted to leave that for a different PR, perhaps create an issue to remember this option in case we have bandwidth (or need) in the future to explore it.

cool that makes sense. Was just curious if this might be useful as a drop in scaler for some other parts of the package

darribas commented 11 months ago

Too many threads here, all worrth keeping track:

I've spun off #181 to keep track of the idea of using Dask elsewhere in the library
I think I see your point about the walkthrough for end-users. Do you think a good example would be moving NLCD to US tracts? If so, would you be game to get the notebook done together?
In that notebook, we can time the "sunk cost" of spatially partitioning the tables. I think it's unfair whichever way you do it because, if you don't time it you benefit the Dask approach because you need to do it to use Dask but, if you time it, it's relly a pre-step that you may already have computed (a bit like including the computation time of building a W when you run a LISA... If you are comparing a LISA map with a choropleth, would you time the computation of building the weights?). In any case, it is doable and we can include it.
Cool on function level imports (maybe module since all the Dask functionality is on a separate file right now?). Do you have any example of how we have done this elsewhere I can copy to add here?
Finally, would you be game to think through the intensive/extensive cases a little bit to see if they're low-hanging fruits we can squeeze on this PR or a longer-term ambition? If the former, it'd open the door, for example, to calculate NDVI on rasters (e.g., LANDSAT) and then transfer it to polygons (e.g., Census blocks) keeping as much detail as possible. If so, maybe a 1h call could be enough to brainstorm it?

knaaptime commented 10 months ago

update:

@darribas and @knaaptime had a call on 8/15 to discuss expansion and integration of dask-based interpolation. Summarizing, we decided this PR likely has the logic/bones necessary to implement intensive/extensive interpolation (in addition to categorical) but initial tests dont pass. We thought maybe refactoring slightly, so that the distributed task works only on building the correspondence table, (not the interpolation/multiplication through the table) but in that case the final interpolation/aggregation after coming back from the workers seems very expensive. We've decided to move forward with the option for categorical variables using dask with the idea that intensive/extensive should be on the menu and may be a good task for someone to pick up in the near future

codecov[bot] commented 10 months ago

Codecov Report

Merging #180 (17841d3) into main (df0cbc6) will increase coverage by 3.12%. The diff coverage is 98.18%.

@@            Coverage Diff             @@
##             main     #180      +/-   ##
==========================================
+ Coverage   51.39%   54.52%   +3.12%     
==========================================
  Files          14       15       +1     
  Lines         786      840      +54     
==========================================
+ Hits          404      458      +54     
  Misses        382      382

Files Changed	Coverage Δ
tobler/area_weighted/area_interpolate_dask.py	`98.07% <98.07%> (ø)`
tobler/area_weighted/__init__.py	`100.00% <100.00%> (ø)`
tobler/area_weighted/area_interpolate.py	`94.52% <100.00%> (+0.72%)`	:arrow_up:

darribas commented 10 months ago

Yahoo! tests passing!

pysal / tobler

[WIP] Area-weighted interpolation in Dask #180

Codecov Report