twang15 / PlatoAcademy

Free thoughts live
2 stars 1 forks source link

Linear models #19

Open twang15 opened 3 years ago

twang15 commented 3 years ago

Linear regression in R Formula syntax

  1. The : is for interactions between two terms while the * is for main effects and interactions. The / is another one for interactions but what it does is generate an interaction between the numerator and all of the terms in the denominator (e.g. A/(B+C) = A:B + A:C). The | is for something like "grouped by". So, 1|station would be intercept grouped by station and in parentheses it's random (1|station). That's how you would do nesting.
  2. (1|station/tow) would expand to (1|station)+(1|station:tow) (main effect of station plus interaction between tow and station)

10 Assumptions for Linear regression Logistic regression

twang15 commented 3 years ago

Linear Mixed Effect Model

  1. https://www.linkedin.com/pulse/implementing-mixed-effects-models-r-python-régis-nisengwe/
  2. Linear Mixed Effect Model by R
  3. http://www.mat.ufrgs.br/~giacomo/Softwares/R/Crawley/Crawley%20-%20The%20Book%20R/ch19.pdf
  4. https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1011&context=language_conf
  5. http://www.rensenieuwenhuis.nl/r-sessions-16-multilevel-model-specification-lme4/
  6. https://m-clark.github.io/mixed-models-with-R/random_intercepts.html#fn8
  7. INTRODUCTION TO LINEAR MIXED MODELS
twang15 commented 3 years ago
  1. Interaction term, interaction plot and their interpretation
  2. Multinomial logistic regression v.s Ordinal logistic regression
  3. Survival analysis, David Caughlin
  4. R tutorial
  5. As mathematical representations, statistical models and machine learning algorithms are often indistinguishable. In practice, they tend to be used differently. Machine learning focuses on data-driven prediction, whereas statistical modeling focuses on theory-driven knowledge discovery.
    • Theory driven | Data driven
    • Explanation | Prediction
    • Researcher-curated data | Machine-generated data
    • Evaluation via goodness of fit | Evaluation via prediction accuracy
  6. Statistical Modeling v.s Machine learning
  7. Logistic regression modeling in R
twang15 commented 3 years ago

Notes on Statistical Modeling (casual modeling) v.s Machine learning (predictive modeling)

  1. VIF to check multicollinearity for both Predictive models and Casual models
  2. Casual models focuses on coefficients' signs and magnitudes while predictive models focus on prediction.
  3. It’s certainly true that with large samples, even small effect sizes can have low p-values.
  4. I definitely think that issues regarding overfitting and cross-validation should be more widely addressed in causal modeling. Why aren’t they? Here are couple possible reasons: 1. Causal modelers typically work with smaller sample sizes and are, therefore, reluctant to split up their data sets. 2. Causal modelers don’t actually have to address the issue of how well their models can perform in a new setting.
  5. Longitudinal vs. cross-sectional data:
    • Longitudinal data are desirable for making causal inferences but they are no panacea.
    • There are situations in which cross-sectional data can be adequate. If you know from theory or just common sense that Y cannot affect X, then cross-sectional data may be adequate.
  6. Link-test : goodness-of-link
    • an R^2 of .2 with only 2% of the cases having events is pretty good. But the linktest suggests that you might do a little bit better with a different link function, or with some transformation of the predictors.
  7. Exogenity
    • Strictly exogenous means the error term is unrelated to any instance of the variable X; past, present, and future. X is completely unaffected by Y.
    • Sequentially exogenous means in which the error term is unrelated to past instances of the variable X. A sequentially exogenous variable is also known as a predetermined variable. X is not affected by past instances of Y; but future instances of X may be affected by current or future instances of Y.
  8. in principle, models that capture the correct causal relationship should be the most generalizable to new settings. I am not aware of any work on this, but that doesn’t mean there isn’t something out there. It would be difficult to research this in any general way, however, because every substantive application will be different.
  9. Regularization: e.g., ridge regression, is needed for both but for different reasons. In inference we need regularization to temper the volatility of estimates when the data is multicollinear and in prediction we need it to temper over fitting. The computation of the hyper parameter(s) is also different. In inference, for example, sometimes the L-curve is used or the trace of the coefficients, etc. but for prediction it is cross validation.
  10. Confounders: In logistic regression, there’s no operational distinction between causal variables and confounders. They’re all just predictor variables in the equation.
  11. moderation vs. mediation: if you introduce the interaction of X with gender, you see strong evidence for the separate effects
twang15 commented 3 years ago

In R, there are three methods to format the input data for a logistic regression using the glm function:

There's no statistical reason to prefer one to the other, besides conceptual clarity. Although the reported deviance values are different, these differences are completely due to the saturated model. So any comparison using relative deviance between models is unaffected, since the saturated model log-likelihood cancels.

twang15 commented 3 years ago
twang15 commented 3 years ago

Many models expand on the basic linear regression model

twang15 commented 3 years ago

Evaluation / Diagnostics for Logistic Regression, 2018-UMass-Logistic Regression.pdf a. Assessment of Linearity (Assumption, Box-Tidwell Test) b. Hosmer-Lemeshow Goodness of Fit Test (overall model quality) c. The Linktest (model specification) d. The Classification Table e. The ROC Curve, Confusion Matrix, etc f. Pregibon Delta Beta Statistic (Cook distance)

twang15 commented 3 years ago

Logistic Regression: 10 Worst Pitfalls and Mistakes

twang15 commented 3 years ago
twang15 commented 3 years ago

Multi-collinearity, regularization, Lasso/Ridge, CV, casual inference, predictive machine learning

Considering multicollineariy is important in regression analysis because, in extrema, it directly bears on whether or not your coefficients are uniquely identified in the data. In less severe cases, it can still mess with your coefficient estimates; small changes in the data used for estimation may cause wild swings in estimated coefficients. These can be problematic from an inferential standpoint: If two variables are highly correlated, increases in one may be offset by decreases in another so the combined effect is to negate each other. With more than two variables, the effect can be even more subtle, but if the predictions are stable, that is often enough for machine learning applications.

Consider why we regularize in a regression context: We need to constrict the model from being too flexible. Applying the correct amount of regularization will slightly increase the bias for a larger reduction in variance. The classic example of this is adding polynomial terms and interaction effects to a regression: In the degenerate case, the prediction equation will interpolate data points, but probably be terrible when attempting to predict the values of unseen data points. Shrinking those coefficients will likely minimize or entirely eliminate some of those coefficients and improve generalization.

A random forest, however, could be seen to have a regularization parameter through the number of variables sampled at each split: you get better splits the larger the mtry (more features to choose from; some of them are better than others), but that also makes each tree more highly correlated with each other tree, somewhat mitigating the diversifying effect of estimating multiple trees in the first place. This dilemma compels one to find the right balance, usually achieved using cross-validation. Importantly, and in contrast to a regression analysis, the predictions of the random forest model are not harmed by highly collinear variables: even if two of the variables provide the same child node purity, you can just pick one.

Likewise, for something like an SVM, you can include more predictors than features because the kernel trick lets you operate solely on the inner product of those feature vectors. Having more features than observations would be a problem in regressions, but the kernel trick means we only estimate a coefficient for each exemplar, while the regularization parameter 𝐶 reduces the flexibility of the solution -- which is decidedly a good thing, since estimating 𝑁 parameters for 𝑁 observations in an unrestricted way will always produce a perfect model on test data -- and we come full circle, back to the ridge/LASSO/elastic net regression scenario where we have the model flexibility constrained as a check against an overly optimistic model. A review of the KKT conditions of the SVM problem reveals that the SVM solution is unique, so we don't have to worry about the identification problems which arose in the regression case.

Finally, consider the actual impact of multicollinearity. It doesn't change the predictive power of the model (at least, on the training data) but it does screw with our coefficient estimates. In most ML applications, we don't care about coefficients themselves, just the loss of our model predictions, so in that sense, checking VIF doesn't actually answer a consequential question. (But if a slight change in the data causes a huge fluctuation in coefficients [a classic symptom of multicollinearity], it may also change predictions, in which case we do care -- but all of this [we hope!] is characterized when we perform cross-validation, which is a part of the modeling process anyway.) A regression is more easily interpreted, but interpretation might not be the most important goal for some tasks.

twang15 commented 3 years ago

How to integrate normalization, feature selection, hyper-parameter tuning with pipelined Grid Search?