revisit regression approach

knaaptime commented 4 years ago

Currently the linear_model function does not work as I would expect. Right now, it calls rasterstats to get pixel counts for the source_df, then estimates a model that relates pixel counts to polygon population. Then, it takes that relationship as given (i.e. pixels become non-stochastic weights that relate each pixel value with a certain number of people), ignores the model's error term, and sums up the pixels in the target_df to estimate population. We're like, wayy overshooting using this approach

Instead we should:

1) call rasterstats to get pixel counts in source_df 2) estimate a model based on pixel counts and population 3) call rasterstats to get pixel counts in target_df 4) use the model to simulate values in the target_df

That should get proper model-based estimation and get us away from the need to create a correspondence table and mask out non-overlapping pixels, etc. (and hopefully speed things up)

would you agree @renanxcortes ?

knaaptime commented 4 years ago

ok, more thoughts. Clearly the above is a bit scatterbrained... obviously we're ignoring the error terms (what, are we gonna add random noise?). I was tripping up over the language that we're "calculating population for each pixel" which is not what we're doing in the regression.

Here's where i think the stickyness is. We're trying to use a shared set of functions for the area weighted and model-based approaches and that can cause confusion.

We originally built the raster/polygon functions for use with the area weighted approaches, where pixel values represent population weights. In this case, we have a direct relationship between pixels and people (that is strictly positive) and we are actually modeling population at the pixel level. With the model-based approaches this is no longer the case; we're modeling the relationship between some mix of pixel types and the polygon-level population. Critically, pixel values are now coefficients instead of weights because (1) it's perfectly reasonable for some of them to be negative and (2) there's probably some non-trivial interaction effects so we can't talk about a direct relationship between pixels and population when there are complex relationships between groups of pixels

so while i didnt describe it well above, i still think we should move toward this approach

- call rasterstats to get pixel counts in source_df
- estimate a model based on pixel counts and population
- call rasterstats to get pixel counts in target_df
- use the model to simulate values in the target_df

knaaptime commented 4 years ago

doing so would mean we can get around the costly "estimating population per pixel" and "Estimating target polygon values" loops and instead just use rasterstats to grab pixel counts and multiply the coefficients through in one shot

renanxcortes commented 4 years ago

It hard to write here now, I can describe in more details later (do you have whatsapp? I can send you an audio)... But, pixels of the same type have same beta (which is the weight), but not necessarily the same Correction Term. Therefore, we need to make this multiplication in the pixel level and then sum up after that... This multiplication is the "pixel population".. So not really sure how to workaround this easily

Em qui, 19 de dez de 2019 17:52, eli knaap notifications@github.com escreveu:

doing so would mean we can get around the costly "estimating population per pixel" and "Estimating target polygon values" loops and instead just use rasterstats to grab pixel counts and multiply the coefficients through in one shot

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pysal/tobler/issues/41?email_source=notifications&email_token=AFML5JHRBFXHGPXK7LZHLNLQZPGI3A5CNFSM4J5MATS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKXPSQ#issuecomment-567637962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFML5JGMWBBGN4I5IWA7N7DQZPGI3ANCNFSM4J5MATSQ .

knaaptime commented 4 years ago

if we have pixel counts for each polygon (in both source and target) then why do we need a correction term? cant we just multiply the target_df pixel counts by the coefficients?

renanxcortes commented 4 years ago

The Correction Term is a population local rescaling factor. By only multiplying the target pixel count with the weights, you loose this constrained... Although you could apply a global rescaling factor after that to ensure that the original population will match the interpolated population... For the global case, the pixel population would not be needed and things would be, indeed, way faster. But for the local consistency.. This performance decrease seems to be the price that we need to pay.

knaaptime commented 4 years ago

ah i see.

but maybe this rescaling is where we need to focuse our attention because the global scaling is indeed way off

knaaptime commented 4 years ago

...which is confusing, because shouldnt our global total be pretty close if our individual totals are correct?

knaaptime commented 4 years ago

a reasonable amount of this is just the default specification. I think we're leaving a lot of information on the table by only including the 4 developed pixel types be default

the bottom image shows the error distribution when i include an intercept and logged versions of all the available pixel values. Its a pretty wide distribution ultimately, but also tightly centered on 0 and I think a lot of that range may come from the rescaling we're doing.

ultimately, maybe the best thing to do is just allow for parameterization of the rescaling so users can opt for local, global, or none

knaaptime commented 4 years ago

in #40 i added a function glm that implements what I outline above. seems to outperform the pixel-adjusted approach by a bit

pysal / tobler

revisit regression approach #41