statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.99k stars 2.87k forks source link

ENH: Computational Techniques for Variable Selection in Linear Regression Analysis #7835

Open kmantych opened 2 years ago

kmantych commented 2 years ago

I would like a way to perform different methods for variable selection including:

In particular, I have been looking through the documentation to find a way to generate all possible regressions quickly (up to 30 regressors) and to generate a summary of the results. I was unable to find this functionality, but if it does already exist please correct me.

Describe the solution you'd like

I would like a simple method to generate all possible models, and print a summary of results that would look something like this:

image

Additional context

I have not contributed to any open source projects yet, however I can implement this if it is recommended by other people in the community. I am very open to advice and suggestions.

The image I used and a description of these different methods for variable selection are from:

Montgomery, Douglas C., et al. “Chapter 10: Variable Selection and Model Building.” Introduction to Linear Regression Analysis, John Wiley & Sons, Inc., Hoboken, 2021.

bashtage commented 2 years ago

Functions that were general (at least for forward/backward) in the sense that they would work for OLS of GLM and anying model that is linear enough would be useful.

josef-pkt commented 2 years ago

all subset regression for 30 variables is a lot, around 1 Billion 2**30

for larger number of variables, penalized estimator like lasso, elastic net or scad will be more efficient.

I have implemented a sweep algorithm for all subset regression in #6268 It might not have nice, user friendly infrastracture around it yet. AFAIR, I went up to 15 variables in my trial runs. The PR failed to be merge ready, because it has a second part in it that I didn't have time to go over again.

related: I experimented with model averaging based on aic, bic, but didn't write any clean code, AFAIR

There is also best subset selection, that truncates the search tree and avoids to compute all subset regressions. I looked into that once, but also no clean code for it.

I don't remember if I looked at restricted subset regression with a max subset size. I don't remember what an efficient algorithm for that is.

stepwise regression doesn't have a very good reputation, so I never worked much on it.

1809

One case where using more carefully constructed loops would be helpful is when we have multi-column terms in the design matrix, for example categorical variables where we either want to include all or exclude all levels at the same time.

6248

Given the popularity of these methods, it would be good to have a selection of methods in statsmodels. I'm working every once in a while on it, but not very systematically.

bashtage commented 2 years ago

Best subset is only really viable for OLS.

The hard part of including these in statsmodels is that they are closer to ML. scikit-learn has actively decided not to include these from what I can tell.

kmantych commented 2 years ago

Thank you all for responding. @josef-pkt thanks for the links, they will be very helpful as I start to pplan out my implementation and look at what others have done. I understand the concern about generating 2 * K models. The book I referenced earlier mentions algorithms that can be used for optimizations and have already been implemented ( Not 100% sure but I believe in R (leaps()?) SAS, and Minitab):

There are several algorithms potentially useful for generating all possible regressions. For example, see Furnival [1971], Furnival and Wilson [1974], Gartside [1965, 1971], Morgan and Tatar [1972], and Schatzoff, Tsao, and Fienberg [1968]. The basic idea underlying all these algorithms is to perform the calculations for the 2**K possible subset models in such a way that sequential subset models differ by only one variable. This allows very efficient numerical methods to be used in performing the calculations. These methods are usually based on either Gauss–Jordan reduction or the sweep operator (see Beaton [1964] or Seber [1977]). Some of these algorithms are available commercially. For example, the Furnival and Wilson [1974] algorithm is an option in many computer programs.

@bashtage I will look further into implementations of these algorithms and see why scikit-learn has actively decided not to include these.

From the general opinion and in the other threads, I will start with forward/backward selection for OLS given its popularity and easier implementation and go from there.

josef-pkt commented 2 years ago

Note that we are not allowed copy/translate code from R unless it's compatible with our BSD-3 license. Most of R is GPL which is not compatible.

kmantych commented 2 years ago

good to know. I stick with the textbooks then.

kmantych commented 2 years ago

I included a link with basic forward selection implementation. I think there is floating point noise so the more regressors that are considered, the less accurate the t-statistic becomes.

I didn't consider multi-column regressors (ex. categorical).

https://github.com/kmantych/variable_selection/blob/main/Playing%20with%20Forward%20Selection.ipynb

Results from the test I performed matched the minitab output displayed in my textbook, except for small differences in t-values.

bashtage commented 2 years ago

While I think t could be an option, it would be better to default to SSE (or log likelihood in other models) to select the next step. This avoids needing to invert the regressor matrix which could be problematic in some problems if the number of regressors is large of the p-value threshold is high (e.g. 1).

kmantych commented 2 years ago

Here is an implementation where ssr is used to select the next regressor. It produced the same results as the first implementation of forward selection for the three sets of data that I tested it on. https://github.com/kmantych/variable_selection/blob/main/forward_selection_v2.ipynb

One issue is that I am fitting models repeatedly to get ssr for each model that is considered. Is there an optimized way to do this? I timed both versions, with the small datasets. v1 (correlation matrix) outperforms v2 (ssr) in regards to time for these smaller datasets. (less than 10 regressors)

kmantych commented 2 years ago

Nevermind, I have an idea for optimization. Will post again after testing on larger data sets

bashtage commented 2 years ago

In a nutshell, you can do everythign using univariate regression by repeatedly orthogonalizing the data to variables in the model.

bashtage commented 2 years ago

@kmantych any progress on this?

kmantych commented 2 years ago

Will begin working on more efficient solutions now. Here are the simple solutions that I used for my class:

generating all possible regressions (really slow solution, creates all 2^n possible models with no optimization) https://github.com/kmantych/variable_selection/blob/main/All_Models_Variable_Selection_V1(Brute_Force).ipynb backward elimination - the max number of models it will create is the number of regressors, can choose to keep the const or allow the algo to remove it. https://github.com/kmantych/variable_selection/blob/main/backward_elimination_v1(Brute_Force).ipynb

stepwise regression https://github.com/kmantych/variable_selection/blob/main/stepwise_selection_V1(Brute_Force).ipynb

mbaudin47 commented 1 year ago

You might be interested by this implementation : https://openturns.github.io/openturns/latest/user_manual/response_surface/_generated/openturns.LinearModelStepwiseAlgorithm.html It provides forward, backward and both and is available from Python. It does not include the best subset algorithm, neither raw nor with leaps and bound.

josef-pkt commented 11 months ago

I just found a reference discussing different approaches to feature selection sparse (in parameter space) regression

issue in statistical science with 2 articles and many comment articles https://projecteuclid.org/journals/statistical-science/volume-35/issue-4 (I did not look at them except a few abstracts)