pysal / mgwr

Multiscale Geographically Weighted Regression (MGWR)

https://mgwr.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

368 stars 126 forks source link

enh: support patsy model formulas #77

Open knaaptime opened 4 years ago

knaaptime commented 4 years ago

similar to what I've just raised over at spreg, it would be a really nice addition to allow model specifications via patsy formulas. In this case, it would kill two birds with one stone, since I notice predict method hasnt yet been implemented and including a patsy API would go a long way towards addressing #47

I can get started working on this if folks agree, but also like spreg I'd be interested in (1) whether folks want to include this addition and (2) what a good api strategy would be like

ljwolf commented 4 years ago

It'd be great.

For history, I had a proof of concept for spreg back in 2016, but that got bogged down with (1) concerns about changing to APIs and (2) ambiguity about how to denote instrumental variable formulas (see nlm in R for an example

The proposed solution was to implement in a separate formula module like statsmodels. We pushed for it in a gsoc and then didn't accept the applicant.

Get Outlook for Androidhttps://aka.ms/ghei36

From: eli knaap notifications@github.com Sent: Friday, February 14, 2020 4:27:32 PM To: pysal/mgwr mgwr@noreply.github.com Cc: Levi John Wolf levi.john.wolf@gmail.com; Assign assign@noreply.github.com Subject: Re: [pysal/mgwr] enh: support patsy model formulas (#77)

Assigned #77https://github.com/pysal/mgwr/issues/77 to @ljwolfhttps://github.com/ljwolf.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/pysal/mgwr/issues/77?email_source=notifications&email_token=AARFR4Z7LXEXUDLMQC4M2CDRC3A7JA5CNFSM4KVLODL2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWUUL7VQ#event-3039346646, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR44E4CS2IZASD65MU33RC3A7JANCNFSM4KVLODLQ.

darribas commented 4 years ago

Just a crazy and probably dumb idea but will throw it out there just in case there's some value. Would it make sense to have an "umbrella" module for all models in pysal that implements this formula approach but allows us to do it somehow "on top" of all the packages we have that implement models?

I'm thinking something where the user could pass a formula, a GeoDataFrame, and either the class or a str for the model they want to run, and the module/method would do the magic of dispatching everything. If well-designed, it'd be much easier to use from the user's perspective, and it'd also allow us to benefit from having pysal as a "package of packages"/federation, unifying APIs where possible across modules.

What do you think?

ljwolf commented 4 years ago

I really like that! We'd need to spec out 4 things with this, I think. Only the first was on my radar before... let's consider hoval = crime + income.

What about autoregression? Before, something I had suggested was defining an operator to specify that something was simultaneous autoregressive. In that proposal, something like r(hoval) = crime + income was SAR-lag, and hoval = crime + income + r() was SAR-Error. HAC estimators would still need to be specified in a keyword, I think.
What about instruments?
Now, we have mgwr, what about locality? Same as above, we could define a l() function to mean "local", so that hoval = l(crime) + l(income) is an MGWR for crime & income, but hoval = l(crime + income) is a GWR, and hoval = l(crime) + income is semiparametric GWR with only a local term for crime. @TaylorOshan, perspective?
With spvcm, what about multilevels? We'd need to figure out a lme4-style syntax, in addition to a spreg-style autoregressive indicator, since patsy doesn't understand the pipe-plus-grouping syntax, (effect | group).

TaylorOshan commented 4 years ago

I recall chatting about this a few years back. In highlighting those four issues above that need to be addressed in order to produce module-wide formula API, I think I am sensing two different situations. One is some kind of functionality that creates a design matrix to be passed to a method whereas the other, which could satisfy all four of the above points, is a dispatcher that allows one or more methods to be called by only specifying a formula? In terms of mgwr, I think it would be really neat to have a formula based API that would allow you to deploy all the different variations of gwr/mgwr/semiparametric, though I wonder if this would be too specific to this type of method. For example if we have a single API that accommodates all four points above, are we opening users up to the possibility of easily specifying nonsensical models? Perhaps a simple API for building design matrices would be a good place to start that applies module-wide and then we could build module-specific tweaks and dispatchers on top of this?

knaaptime commented 4 years ago

i was thinking along the same lines as Taylor. Ideally we could have a dispatcher that lives in libpysal and provides a robust way of expressing lots of different models using only a formula. If we're going to put some real effort into this, this is probably the "right" way because it opens the door to a wider variety of model specs.

As a first cut, though, we could use patsy to just prepare input data to the existing models (i.e. where models live in their own classes), if for no other reason than to make it easier to use geodataframes. Responding also to @lanselin 's comment from the other thread

not only is there a potential issue with spatial lags, there are also regime variables. how would those fit into the patsy syntax? same with spatially lagged explanatory variables (SLX, spatial Durbin), ideally computed on the fly (but not in the current implementation). and where would the weights be specified?

I think we could use something like the groups and re_formula arguments for spreg regimes and spvcm grouping variables like statsmodels does for multilevel models (in R, more nlme than lme4, where random is specified separately). I think a stateful transform might work for lagged explanatory variables but the shortest path would probably be to have grouping/regime/W/additional lags in separate arguments, similar to the way it's handled now.

I was looking into some of these ideas here. It seems to work pretty well for mgwr. It fails for spreg though... I don't think it's related to patsy per-se but also stumped for other ideas.

knaaptime commented 4 years ago

an additional small thing is the way intercepts are handled. our packages expect matrices without the constant, so right now patsy strings need to exclude the intercept

darribas commented 4 years ago

I have similar thoughts, in an ideal world, the formula would describe the entire model. But I’m not sure there’sa fórmula grammar for spatial models atm. To get started, I also think a good first step might be to use patsy and traditional formulas for the non spatial component and then specify the class it is to be sent to as a different argument.

The more I think about this, the more excited I get. This is also not a trivial task. Would it make sense to add it as a potential GSoC project? I think we should still be in time?

On Tue, 18 Feb 2020 at 04:02, eli knaap notifications@github.com wrote:

an additional small thing is the way intercepts are handled. our packages expect matrices without the constant, so right now patsy strings need to exclude the intercept

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pysal/mgwr/issues/77?email_source=notifications&email_token=AADF4U2MA5BTJVKEQFJ67FLRDNMXBA5CNFSM4KVLODL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMAP4YI#issuecomment-587267681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADF4UZVIOGG77PEHAEDIALRDNMXBANCNFSM4KVLODLQ .

--

Daniel Arribas-Bel, PhD. Url: darribas.org Mail: D.Arribas-Bel@liverpool.ac.uk

Senior Lecturer in Geographic Data Science Department of Geography and Planning University of Liverpool (UK)