uladkasach / Word-Subject-Classification

Classification of word vectors by subject
1 stars 0 forks source link

Add variance -vs- bias analysis to summary and analysis #8

Open uladkasach opened 7 years ago

uladkasach commented 7 years ago

Variance -vs- bias = overfitting training data -vs- underfitting test data

At first, can make simply a measurement of GENERAL_SUCCESS_RATING between test and training for every classification and stick it in the analysis. Just create a second analysis script for the training data. (append _train).

In summaries, specify whether Bias-vs-Variance summary or ROC summary.

Bias-vs-Variance summary should find that the min distance between train_error and test_error works best - and is also optimal (best test error).

uladkasach commented 7 years ago

Enables seeing which traits increase variance, and for what classifiers

uladkasach commented 7 years ago

How much will this help, however? Random forest classifiers will always be close to zero error for training - but test can vary wildly.

uladkasach commented 7 years ago

Classify training data and 'analyze' it - if it the classification file exists ( not all classifiers will have that ). Then - create a statistic : training-error - test error in summaries. Summary will have to check if it can create that statistic.

Typically learning curves are recommended, e.g., plotting both errors -vs- dataset size or -vs- other HP's. http://datascience.stackexchange.com/questions/5268/how-to-detect-overfitting-of-a-stock-screener