Open twang15 opened 3 years ago
Linear Mixed Effect Model
Notes on Statistical Modeling (casual modeling) v.s Machine learning (predictive modeling)
There's no statistical reason to prefer one to the other, besides conceptual clarity. Although the reported deviance values are different, these differences are completely due to the saturated model. So any comparison using relative deviance between models is unaffected, since the saturated model log-likelihood cancels.
Many models expand on the basic linear regression model
Evaluation / Diagnostics for Logistic Regression, 2018-UMass-Logistic Regression.pdf a. Assessment of Linearity (Assumption, Box-Tidwell Test) b. Hosmer-Lemeshow Goodness of Fit Test (overall model quality) c. The Linktest (model specification) d. The Classification Table e. The ROC Curve, Confusion Matrix, etc f. Pregibon Delta Beta Statistic (Cook distance)
Considering multicollineariy is important in regression analysis because, in extrema, it directly bears on whether or not your coefficients are uniquely identified in the data. In less severe cases, it can still mess with your coefficient estimates; small changes in the data used for estimation may cause wild swings in estimated coefficients. These can be problematic from an inferential standpoint: If two variables are highly correlated, increases in one may be offset by decreases in another so the combined effect is to negate each other. With more than two variables, the effect can be even more subtle, but if the predictions are stable, that is often enough for machine learning applications.
Consider why we regularize in a regression context: We need to constrict the model from being too flexible. Applying the correct amount of regularization will slightly increase the bias for a larger reduction in variance. The classic example of this is adding polynomial terms and interaction effects to a regression: In the degenerate case, the prediction equation will interpolate data points, but probably be terrible when attempting to predict the values of unseen data points. Shrinking those coefficients will likely minimize or entirely eliminate some of those coefficients and improve generalization.
A random forest, however, could be seen to have a regularization parameter through the number of variables sampled at each split: you get better splits the larger the mtry (more features to choose from; some of them are better than others), but that also makes each tree more highly correlated with each other tree, somewhat mitigating the diversifying effect of estimating multiple trees in the first place. This dilemma compels one to find the right balance, usually achieved using cross-validation. Importantly, and in contrast to a regression analysis, the predictions of the random forest model are not harmed by highly collinear variables: even if two of the variables provide the same child node purity, you can just pick one.
Likewise, for something like an SVM, you can include more predictors than features because the kernel trick lets you operate solely on the inner product of those feature vectors. Having more features than observations would be a problem in regressions, but the kernel trick means we only estimate a coefficient for each exemplar, while the regularization parameter 𝐶 reduces the flexibility of the solution -- which is decidedly a good thing, since estimating 𝑁 parameters for 𝑁 observations in an unrestricted way will always produce a perfect model on test data -- and we come full circle, back to the ridge/LASSO/elastic net regression scenario where we have the model flexibility constrained as a check against an overly optimistic model. A review of the KKT conditions of the SVM problem reveals that the SVM solution is unique, so we don't have to worry about the identification problems which arose in the regression case.
Finally, consider the actual impact of multicollinearity. It doesn't change the predictive power of the model (at least, on the training data) but it does screw with our coefficient estimates. In most ML applications, we don't care about coefficients themselves, just the loss of our model predictions, so in that sense, checking VIF doesn't actually answer a consequential question. (But if a slight change in the data causes a huge fluctuation in coefficients [a classic symptom of multicollinearity], it may also change predictions, in which case we do care -- but all of this [we hope!] is characterized when we perform cross-validation, which is a part of the modeling process anyway.) A regression is more easily interpreted, but interpretation might not be the most important goal for some tasks.
Linear regression in R Formula syntax
10 Assumptions for Linear regression Logistic regression