pysal / spreg

Spatial econometric regression in Python
https://pysal.org/spreg/
Other
66 stars 23 forks source link

GSoC 2022 Interfaces for Consistent API Design #95

Open tdhoffman opened 2 years ago

tdhoffman commented 2 years ago

For GSoC 2022, I'm working on designing more consistent interfaces to PySAL's exploratory and inferential statistics classes. My mentors and I are exploring what might need to be done to

  1. render confederated packages compatible with the scikit-learn paradigm, and
  2. develop R-style Wilkinson formulas for modeling classes.

To these ends, we're interested in getting feedback on the desirability and feasibility of these changes from package leads and devs.

Excited to hear your input!

knaaptime commented 2 years ago

here's a very opinionated take, more to foster discussion than to stake a position i plan to defend... I'm probably not even the best person to weigh in over here, but here are some musings anyway :P

toward the sklearn question, which piece are you referring to? There are some pieces of their design that we've actively tried to mimic from the the start with some packages (like, we designed spopt from its inception to use scikit-style mixins, and we use a similar base class convention in segregation). Other pieces im not terribly sure about. I'm sure im missing something obvious, but tbh ive never really understood why model fitting is a separate method in sklearn. I'm sure there's a logic to it, but design patterns like that feel tedious to me. What are the use cases when you have a 'model' that you dont fit? Why doesnt the fit happen at init?

i suspect there will be some differing opinions here. Personally, I'd like to see the wilkinson formulas implemented, especially here in spreg. But this is another question about which pieces of sklearn you want to bring over. We already have a signature similar to their Model.fit(X,y) here, but id much rather use formulas (even if those just dispatch to .fit(X,y) the way statsmodels does, so should be relatively simple to support both, as long as we can do formulas). My bias is that im a social scientist, and formulas are a much more natural way to describe the model to me, conceptually (i.e. where one variable on a dataframe is a function of others, as opposed to a collection of matrices, even if that's what we're ultimately operating on).

my take is pysal's DNA originates from spatial econometrics, where the emphasis is on identification and consistency/efficiency. The primary audience for [many of] the package[s], then (especially this one), is folks who tend to think according to that mental model--i.e. people who focus on the conceptual structure of the model and its constituent components, not just its predictive capacity. We want to lower the barrier of entry to using our tools, and if our primary audience is social scientists, that means mapping our inputs to their mental model. Maybe this gets nearer to the root of your question... How do we accommodate multiple audiences, and if those audiences have some competing expectations, which gets priority, given our limited dev time? I think for things like spreg, it makes more sense to cater to social scientists than ML folks and 'generic data scientists,' if that distinction makes any sense. There are things that just aren't canon over here (like the ubiquitous train/test split in ML... we're not doing checks on predictive accuracy, we're focused on isolating marginal relationships) so the workflows are often different and the API design should reflect that

To me, that means you want some flexibility to change a model specification without modifying your whole dataset... I want to include a categorical variable in a formula, not manually add 6 new columns of one-hot encoded variables to a copy of a dataframe I might throw away if the model doesnt work out--even if that's what's happening behind the scenes. If we go the scikit-route and just consume a D matrix, I'm forced to do the latter. There are also implications for the presentation of results. In scikit, you can pass a numpy array, it will fit the model and give you back the results. If you want to inspect the coefficients, you dont have a dataframe, so no column names are attached to your estimates (that's fine for scikit because often in ML they're just generic 'features' anyway, and aren't the primary focus of the model; here it's the opposite). Sure, you can map it back yourself, but thats an irritating extra step instead of giving me the info i need right away.

tldr; my vote would be to have modeling functions/classes adopt a signature (and summary) more similar to their analogues in R. I'm not sure whether adopting scikit conventions gets us closer to that goal but wilkinson formulas definitely do. I think a good design inspiration for this package is much closer to linearmodels than scikit-learn

lanselin commented 2 years ago

Interesting discussion. Personally, I don't like the scikit-learn interface much for spatial regression/spatial econometrics given its focus on prediction and cross-validation. Given the cross-sectional structure for spreg's methods, prediction doesn't actually make sense. Instead, inference and interpretation of the direct and indirect effects matter, which scikit-learn or ML generally don't really care about. I do think incorporating a formula interface would be useful, actually something we have talked about for a number of years, without a satisfactory solution so far. As I recall, the bottleneck for using existing frameworks is to incorporate an efficient way to deal with spatially lagged variables. Maybe that is something to work on. I agree that having methods like summary etc. would be useful, but to some extent, that is already the case.

For me, developing a flexible formula framework to deal with the scope of spatial models would be a high priority.

knaaptime commented 2 years ago

ever the master, I think @lanselin just compressed my 6 page essay into 6 sentences, making largely the same argument

his is the opinion that matters; mine's fluff. 😎