Linear models - Githubissues

twang15 commented 3 years ago

Linear regression in R Formula syntax

The : is for interactions between two terms while the * is for main effects and interactions. The / is another one for interactions but what it does is generate an interaction between the numerator and all of the terms in the denominator (e.g. A/(B+C) = A:B + A:C). The | is for something like "grouped by". So, 1|station would be intercept grouped by station and in parentheses it's random (1|station). That's how you would do nesting.
(1|station/tow) would expand to (1|station)+(1|station:tow) (main effect of station plus interaction between tow and station)

10 Assumptions for Linear regression Logistic regression

twang15 commented 3 years ago

Linear Mixed Effect Model

twang15 commented 3 years ago

Interaction term, interaction plot and their interpretation
Multinomial logistic regression v.s Ordinal logistic regression
Survival analysis, David Caughlin
R tutorial
- Practical R tutorial, David Caughlin
- PhD+: Data Science Essential in R
As mathematical representations, statistical models and machine learning algorithms are often indistinguishable. In practice, they tend to be used differently. Machine learning focuses on data-driven prediction, whereas statistical modeling focuses on theory-driven knowledge discovery.
- Theory driven | Data driven
- Explanation | Prediction
- Researcher-curated data | Machine-generated data
- Evaluation via goodness of fit | Evaluation via prediction accuracy
Statistical Modeling v.s Machine learning
- Prediction vs. Causation in Regression Analysis _ Statistical Horizons.pdf
Logistic regression modeling in R
- Model selection with step-wise method
- add1(), drop1(), step()
- https://uvastatlab.github.io/phdplus/linearmodel.html

twang15 commented 3 years ago

Notes on Statistical Modeling (casual modeling) v.s Machine learning (predictive modeling)

VIF to check multicollinearity for both Predictive models and Casual models
Casual models focuses on coefficients' signs and magnitudes while predictive models focus on prediction.
It’s certainly true that with large samples, even small effect sizes can have low p-values.
I definitely think that issues regarding overfitting and cross-validation should be more widely addressed in causal modeling. Why aren’t they? Here are couple possible reasons: 1. Causal modelers typically work with smaller sample sizes and are, therefore, reluctant to split up their data sets. 2. Causal modelers don’t actually have to address the issue of how well their models can perform in a new setting.
Longitudinal vs. cross-sectional data:
- Longitudinal data are desirable for making causal inferences but they are no panacea.
- There are situations in which cross-sectional data can be adequate. If you know from theory or just common sense that Y cannot affect X, then cross-sectional data may be adequate.
Link-test : goodness-of-link
- an R^2 of .2 with only 2% of the cases having events is pretty good. But the linktest suggests that you might do a little bit better with a different link function, or with some transformation of the predictors.
Exogenity
- Strictly exogenous means the error term is unrelated to any instance of the variable X; past, present, and future. X is completely unaffected by Y.
- Sequentially exogenous means in which the error term is unrelated to past instances of the variable X. A sequentially exogenous variable is also known as a predetermined variable. X is not affected by past instances of Y; but future instances of X may be affected by current or future instances of Y.
in principle, models that capture the correct causal relationship should be the most generalizable to new settings. I am not aware of any work on this, but that doesn’t mean there isn’t something out there. It would be difficult to research this in any general way, however, because every substantive application will be different.
Regularization: e.g., ridge regression, is needed for both but for different reasons. In inference we need regularization to temper the volatility of estimates when the data is multicollinear and in prediction we need it to temper over fitting. The computation of the hyper parameter(s) is also different. In inference, for example, sometimes the L-curve is used or the trace of the coefficients, etc. but for prediction it is cross validation.
Confounders: In logistic regression, there’s no operational distinction between causal variables and confounders. They’re all just predictor variables in the equation.
moderation vs. mediation: if you introduce the interaction of X with gender, you see strong evidence for the separate effects

twang15 commented 3 years ago

In R, there are three methods to format the input data for a logistic regression using the glm function:

Data can be in a "binary" format for each observation (e.g., y = 0 or 1 for each observation);
Data can be in the "Wilkinson-Rogers" format (e.g., y = cbind(success, failure)) with each row representing one treatment; or
Data can be in a weighted format for each observation (e.g., y = 0.3, weights = 10).

There's no statistical reason to prefer one to the other, besides conceptual clarity. Although the reported deviance values are different, these differences are completely due to the saturated model. So any comparison using relative deviance between models is unaffected, since the saturated model log-likelihood cancels.

twang15 commented 3 years ago

[Logistic Regression: Hosmer-Lemeshow goodness of fit test for logistic regression)(https://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/)
Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression
1. Model specification: both linear and logistic regression, it’s possible to have a low R2 and still have a model that is correctly specified in every respect. And vice versa, you can have a very high R2 and yet have a model that is grossly inconsistent with the data.
2. Model specification
linear vs. non-linear
link function (link test)
interactions
features
1. One would hope that adding a statistically significant interaction or non-linearity to a model would improve its fit, as judged by the HL test. But often that doesn’t happen. The reverse can also happen. Quite frequently, adding a non-significant interaction or non-linearity to a model will substantially improve the HL fit.

twang15 commented 3 years ago

Many models expand on the basic linear regression model

Genearlized linear models (e.g., logit, poisson, multinomial, etc.)
Mixed effects models (random coefficients, hierarchical models)
Penalized regression (shrinkage or regulariziation, e.g., Ridge, Lasso, ElasticNet)
and more!

twang15 commented 3 years ago

Evaluation / Diagnostics for Logistic Regression, 2018-UMass-Logistic Regression.pdf a. Assessment of Linearity (Assumption, Box-Tidwell Test) b. Hosmer-Lemeshow Goodness of Fit Test (overall model quality) c. The Linktest (model specification) d. The Classification Table e. The ROC Curve, Confusion Matrix, etc f. Pregibon Delta Beta Statistic (Cook distance)

twang15 commented 3 years ago

Logistic Regression: 10 Worst Pitfalls and Mistakes

twang15 commented 3 years ago

Logistic Regression: Scikit Learn vs Statsmodels
- since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none', https://stackoverflow.com/questions/62005911/coefficients-for-logistic-regression-scikit-learn-vs-statsmodels
Model interpretability: Marginal Effects
Model interpretability: Shaply value
Ensemble learning
Comparison between Sklearn logisticRegression and Statsmodels logit
Logistic Regression Scikit-learn vs Statsmodels _ Finxter.pdf
https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels
- add statsmodels intercept sm.Logit(y,sm.add_constant(X)) OR disable sklearn intercept LogisticRegression(C=1e9,fit_intercept=False)
- sklearn returns probability for each class so model_sklearn.predict_proba(X)[:,1] == model_statsmodel.predict(X)
- Use of predict fucntion model_sklearn.predict(X) == (model_statsmodel.predict(X)>0.5).astype(int)
- Also need to set max_iter and solver explicitly to the same
Statsmodels Logit: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Logit.fit.html#statsmodels.discrete.discrete_model.Logit.fit

twang15 commented 3 years ago

Multi-collinearity, regularization, Lasso/Ridge, CV, casual inference, predictive machine learning

Considering multicollineariy is important in regression analysis because, in extrema, it directly bears on whether or not your coefficients are uniquely identified in the data. In less severe cases, it can still mess with your coefficient estimates; small changes in the data used for estimation may cause wild swings in estimated coefficients. These can be problematic from an inferential standpoint: If two variables are highly correlated, increases in one may be offset by decreases in another so the combined effect is to negate each other. With more than two variables, the effect can be even more subtle, but if the predictions are stable, that is often enough for machine learning applications.

Consider why we regularize in a regression context: We need to constrict the model from being too flexible. Applying the correct amount of regularization will slightly increase the bias for a larger reduction in variance. The classic example of this is adding polynomial terms and interaction effects to a regression: In the degenerate case, the prediction equation will interpolate data points, but probably be terrible when attempting to predict the values of unseen data points. Shrinking those coefficients will likely minimize or entirely eliminate some of those coefficients and improve generalization.

A random forest, however, could be seen to have a regularization parameter through the number of variables sampled at each split: you get better splits the larger the mtry (more features to choose from; some of them are better than others), but that also makes each tree more highly correlated with each other tree, somewhat mitigating the diversifying effect of estimating multiple trees in the first place. This dilemma compels one to find the right balance, usually achieved using cross-validation. Importantly, and in contrast to a regression analysis, the predictions of the random forest model are not harmed by highly collinear variables: even if two of the variables provide the same child node purity, you can just pick one.

Likewise, for something like an SVM, you can include more predictors than features because the kernel trick lets you operate solely on the inner product of those feature vectors. Having more features than observations would be a problem in regressions, but the kernel trick means we only estimate a coefficient for each exemplar, while the regularization parameter 𝐶 reduces the flexibility of the solution -- which is decidedly a good thing, since estimating 𝑁 parameters for 𝑁 observations in an unrestricted way will always produce a perfect model on test data -- and we come full circle, back to the ridge/LASSO/elastic net regression scenario where we have the model flexibility constrained as a check against an overly optimistic model. A review of the KKT conditions of the SVM problem reveals that the SVM solution is unique, so we don't have to worry about the identification problems which arose in the regression case.

Finally, consider the actual impact of multicollinearity. It doesn't change the predictive power of the model (at least, on the training data) but it does screw with our coefficient estimates. In most ML applications, we don't care about coefficients themselves, just the loss of our model predictions, so in that sense, checking VIF doesn't actually answer a consequential question. (But if a slight change in the data causes a huge fluctuation in coefficients [a classic symptom of multicollinearity], it may also change predictions, in which case we do care -- but all of this [we hope!] is characterized when we perform cross-validation, which is a part of the modeling process anyway.) A regression is more easily interpreted, but interpretation might not be the most important goal for some tasks.

twang15 commented 3 years ago

twang15 / PlatoAcademy

Linear models #19

Multi-collinearity, regularization, Lasso/Ridge, CV, casual inference, predictive machine learning

How to integrate normalization, feature selection, hyper-parameter tuning with pipelined Grid Search?