### Todos - Githubissues

twang15 commented 3 years ago

Feature selection via exhaustive search.
Estimate search time
Try several other linear (svm) and non-linear (random forest, extra tree, gradientboost, xgboost) model
Model interpretation for best linear model (via statistics, hypothesis testing, and LIMA, Shapley value)
Metrics: auc, accuracy, sensitivity, specificity, ppv
Model selection via nested cv
Model comparison in terms of auc (p-value), accuracy, speed

twang15 commented 3 years ago

Impact of normalization on XGBoost
If XGBoost or other non-linear model is no better, what to do?
- report statistics for several non-linear models (more is better than fewer)
- explain the best non-linear model for more insights than merely explaining logistics regression
SVM, Random Forest and model selection harness
Ensemble/stacking for XGBoost

twang15 commented 3 years ago

Experiments show that stacking brings little benefit.
- Decide to not use stacking/voting

Logit	Xgboost	SVM
0.87421	0.89258	0.8849	16	['age', 'rRR', 'rLen', 'rPTLA', 'lPSA', 'lRR', 'rThick', 'lSPA', 'rPSPA', 'DLK', 'weight', 'rKUPE', 'rPTSA', 'height', 'lPT', 'lThick']
0.87547	0.86283	0.87604	7	['lRP', 'rRP', 'age', 'lTSPA', 'rKUPE', 'weight', 'DLK']
0.87054	0.84427	0.87127	5	['lRP', 'rRP', 'age', 'lTSPA', 'DLK']
0.86539	0.83439	0.86752	4	['lRP', 'rRP', 'age', 'lTSPA']
0.8576	0.80461	0.85891	3	['lTSPA', 'rRP', 'lRP']
0.84638	0.79191	0.84633	2	['lRP', 'rRP']
0.82701	0.76416	0.82701	1	['rRP']

Performance:
- XGBoost, AUC=89.3 %
- Learning curve: overfitting?
Explanations
- Model-level v.s instance-level
- feature importance (Logit): statistical significance, coefficients,
- Decision process (Decision Tree)
- Shapley values

twang15 commented 3 years ago

['rRP', 'lRP', 'lAR', 'lPLA', 'age', 'DLK', 'lThick', 'LE', 'rShort', 'rRR', 'rPTLPA', 'lSPA']

twang15 commented 2 years ago

Experiments:
- A Sensitivity analysis of training set size to prediction variance is recommended to find the point of diminishing returns.

twang15 / PlatoAcademy