Open uladkasach opened 7 years ago
Enables seeing which traits increase variance, and for what classifiers
How much will this help, however? Random forest classifiers will always be close to zero error for training - but test can vary wildly.
Classify training data and 'analyze' it - if it the classification file exists ( not all classifiers will have that ). Then - create a statistic : training-error - test error in summaries. Summary will have to check if it can create that statistic.
Typically learning curves are recommended, e.g., plotting both errors -vs- dataset size or -vs- other HP's. http://datascience.stackexchange.com/questions/5268/how-to-detect-overfitting-of-a-stock-screener
Variance -vs- bias = overfitting training data -vs- underfitting test data
At first, can make simply a measurement of GENERAL_SUCCESS_RATING between test and training for every classification and stick it in the analysis. Just create a second analysis script for the training data. (append
_train
).In summaries, specify whether Bias-vs-Variance summary or ROC summary.
Bias-vs-Variance summary should find that the min distance between train_error and test_error works best - and is also optimal (best test error).