twang15 / PlatoAcademy

Free thoughts live
2 stars 1 forks source link

Data Standardization before modeling #15

Open twang15 opened 3 years ago

twang15 commented 3 years ago

Two seemingly conflicts: interpretability and feature importance

  1. A lot of software for performing multiple linear regression will provide standardised coefficients which are equivalent to unstandardised coefficients where you manually standardise predictors and the response variable (of course, it sounds like you are talking about only standardising predictors).
  2. In cases where the metric does have meaning to the person interpreting the regression equation, unstandardised coefficients are often more informative.
  3. You can always convert standardised coefficients to unstandardised coefficients if you know the mean and standard deviation of the predictor variable in the original sample.
twang15 commented 3 years ago

Use correlations and semi-correlations between outcome variable and predicators as feature importance http://jeromyanglim.blogspot.com/2009/09/variable-importance-and-multiple.html

twang15 commented 3 years ago

at the very least you might need to standardize if you use regularization. Maybe not necessarily to zero mean / equal variance, but at least to something meaningful. Otherwise, regularization will completely ignore the variables that happened to use larger units.

twang15 commented 3 years ago

For this reason, scaling by standard deviation (or standardization/normalization) is generally not recommended, especially when interactions are involved.

twang15 commented 3 years ago

For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take.

twang15 commented 3 years ago

A conflicting viewpoint:

For comparing coefficients for different predictors within a model, standardizing gets the nod. (Although I don’t standardize binary inputs. I code them as 0/1, and then I standardize all other numeric inputs by dividing by two standard deviation, thus putting them on approximately the same scale as 0/1 variables.)

twang15 commented 3 years ago

In regression, it is often recommended to center the variables so that the predictors have mean 0. This makes it easier to interpret the intercept term as the expected value of π‘Œπ‘– when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of π‘Œπ‘– when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). , centering/scaling does not affect your statistical inference in regression models - the estimates are adjusted appropriately and the p-values will be the same.

Other situations where centering and/or scaling may be useful:

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

twang15 commented 3 years ago

https://stats.stackexchange.com/questions/342140/standardization-of-continuous-variables-in-binary-logistic-regression?rq=1

  1. You don't need to standardize for normal logistic regression as long as one keeps units in mind when interpreting the coefficients.
  2. Standardizing can help with interpreting feature importance because then the coefficients should be apples to apples. (ie if your two standardized continuous variables have coeff of 0.01 and 0.7 then you know that the 2nd one is much more important.)
  3. for regularized logistic regression continuous variables should be standardized for best results.
twang15 commented 3 years ago

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

-standardizing is needed when using regularization

twang15 commented 3 years ago

https://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

twang15 commented 3 years ago

TO READ

  1. https://stats.stackexchange.com/questions/25690/multiple-linear-regression-for-hypothesis-testing#25707
  2. Significance of coefficients in multiple regression: significant t-test vs. non-significant F-statistic
  3. How can a regression be significant yet all predictors be non-significant
  4. F and t statistics in a regression
  5. Is 𝑅2 useful or dangerous?
  6. Should I normalize/standardize/rescale
twang15 commented 3 years ago

Summary:

  1. Feature importance: Z-score transformation is needed for feature importance comparison (even the common practice of doing this is questionable, it is how it is done)
  2. Feature selection: fitting the model iteratively by eliminating those features with relative small coefficient (less important features) also requires Z-score transformation.
  3. Normalized coefficients and non-normalized coefficients can be converted to each other with mean and std (if z-score transformation is used to standardize the training data) of each feature in the training dataset.
  4. Z-score transformation / centering / min-max normalization does not affect multiple linear regression / logistic regression using ordinary least square.
  5. Z-score transformation is required for LASSO / Ridge regression.
  6. Centering can eliminate multi-collinearity for X and its quadratic terms X^2.
  7. Centering does not necessarily require subtracting mean of the sample for each observation, the quantity could be other meaningful quantity, like means in the contrast group.
  8. Z-score transformation / Normalization on Dummy variable is unnecessary for linear / logistic regression.
  9. In medical/financial modeling, z-score transformation hurts model interpretation and can be standardized coefficients can be converted back to unstandardized coefficients as stated in 3.
  10. Normalization: min-max normalization, transform inputs into (0-1) range
  11. Standardization: z-score transformation, (x-u)/s
twang15 commented 3 years ago

Importance of Feature Scaling