Closed timodonnell closed 7 years ago
Realized we can kill two birds with one stone here to get ensembles and an estimate of generalization error without sacrificing training data. In the production training run we should train predictors on random subsets of the data (either with or without replacement, perhaps need to experiment to see which works best) and then estimate generalization error by testing on the out-of-bag samples
I guess to keep information leaking from the model selection to the estimate of generalization error we'll actually have to do something like this:
For N predictors:
Our production predictor would then take the average over the N predictors selected above.
To estimate generalization error: for each data point in the training data, take the average of the K < N predictors on it that did not have it in their subsample (i.e. were not trained on it), and calculate the accuracies of those predictions.
On one hand this is nice since we'll be averaging over multiple model architectures. On the other hand, this will be quite expensive. Will definitely need the cloud setup working well.
Simpler/cheaper alternative:
For each model architecture, perform k-fold cross validation to estimate that model architecture's performance. A final ensemble is just the k models trained for CV.
Maybe this is the same as your original idea? We'll get inflated estimates of accuracy (since we're choosing the max over all model architectures) but will have to train way fewer models.
Another downside is that our ensemble size and number of cross-validation iters become tied.
For reference, here's the cross-validation + ensemble strategy described in the NetMHCpan 3.0 paper:
Networks were trained in five-fold cross-validation using gradient descent back-propagation with early stopping. Ensembles were generated by training five networks for each data partition and network architecture each starting from a distinct random initial configuration, leading to an ensemble of 10 networks for each data partition, and a total of 50 networks across all partitions. The ensemble trained on all alleles and all peptide lengths will be referred to as the “allmer” method.
Closed by #86
In our tests ensembles of as many as 64 predictors (perhaps more, haven't tried) show improved accuracy