pysal / tobler

Spatial interpolation, Dasymetric Mapping, & Change of Support
https://pysal.org/tobler
BSD 3-Clause "New" or "Revised" License
145 stars 30 forks source link

Aggregating on something other than area #137

Open adamConnerSax opened 3 years ago

adamConnerSax commented 3 years ago

Tobler is great! Thanks so much for providing it.

I'm wondering if it would be possible to weight some intensive variables using some column from the data rather than the area itself? I'm thinking, for example, of population density when aggregating census data. In some sense, the density of a larger area is just the area-weighted density of the components, sure.

But one might also want the population weighted aggregation, rather than the area-weighted one. And if the source_df had the population in a column, I imagine it might be easy enough to use that as a weight instead of the area. You'd still need to use area overlap to get the fraction of the weight in each target geometry.

Maybe a 3rd category of variables, specified by a dictionary of columns keyed by the column to use as a weight in place of area? Or is there some simple way of doing this that I am missing?

Thanks!

sjsrey commented 3 years ago

Great idea. This is similar in spirit to Dasymetric interpolation. A PR to flesh out the specifics of what you have in mind would be welcomed.

adamConnerSax commented 3 years ago

I'm not a python person. I've managed to get a script working, using tobler to get some data in the form I need. But I don't think I can write good enough python, or figure out how Tobler works well enough, to usefully do any of this myself. And I know that's annoying! If I could see how to do it, I would've tried it before opening the issue...

Is there a different way I could be helpful?

sjsrey commented 3 years ago

Coming up with new use cases is helpful :->

One thought on what you are suggesting is that if population is a column in the source dataframe, then I'm not sure that is going to allow for what you have in mind, since by definition, the population density of the source area is pop/area so for that polygon you would have "constant density" when you split it apart. In other words, the area and population weights would be identical. Unless I'm not grasping what you are after?

adamConnerSax commented 3 years ago

I guess what I am imagining, e.g., is some larger target geometry made up of two smaller source geometries, one high-density, for concreteness, say 50 Sq km and 10 million people, and a second which is low density, say 950 Sq km and 0 people.

So the areal_interpolation density would be 10 million/1000 Sq km = 10000 people/Sq km.

But all the people live at much higher density, so you weight by (pop*overlap area) and get 10 million/50 Sq km = 200000 people/Sq km.

Does that make sense? I think you are replacing the areal weight, in the numerator and denominator, with col*areal_weight.

knaaptime commented 3 years ago

i think what you're proposing is target density weighting a la https://pubmed.ncbi.nlm.nih.gov/28260826/ ?

sjsrey commented 3 years ago

I see what you are getting at. But, the estimated density for the target geometry (in this approach you suggest) would imply a total population of 200 million = 1000 km2 * 20000 p/km2 for that polygon.

If you did extensive interpolation for the numerator (population) and denominator (area) and then formed the density estimate for the target polygon it would be 10000/km2

adamConnerSax commented 3 years ago

@knaaptime I see the connection but I don't think I am proposing anything that complex. As I understand it, TDW uses the target characteristics in order to allocate a source which overlaps multiple targets. What I am thinking about would still apply even if my sources were entirely inside one target, as in my example.

@sjsrey Yes, so the target variable isn't "population density" anymore but "Experienced Population Density" or "people-weighted density" or something. But for purposes of understanding the behavioral consequences of density (voting, in my particular case) I care more about the density the average person experiences than I care about the actual density. So if one congressional district is a city surrounded by empty farmland and the other uniform suburbs, they may have the same density but very different people-weighted densities.

Perhaps another example is useful? Suppose I have a column with the average income in each source geometry. That's clearly not extensive. But it's also not intensive. Suppose I have 2 equal size source geometries, one with a million people, earning, on average, $35,000, and the other with one person earning $1,000,000. Extensive interpolation would give me average income of $1,035,000 and intensive would give me $517,500, right?

But I think I've answered my own question. I can multiply the numerator by the quantity of interest (people), do extensive interpolation on the product and then divide by the (extensively interpolated) total population to get the people-weighted result. Which is simple enough to perhaps not be worth a feature in Tobler. Though it would be convenient...

knaaptime commented 2 years ago

it took me awhile, but i realized a similar example that makes obvious sense to me is volume. If you were trying to estimate population at the building-level from like census block data (which would be a little bold...), the relevant variable on the target gdf is building volume, not area. Not exactly sure the best way to implement yet, but definitely agree it would be a useful enhancement