support ensembles of class1 allele specific predictors

openvax / mhcflurry

Peptide-MHC I binding affinity prediction

http://openvax.github.io/mhcflurry/

Apache License 2.0

193 stars 58 forks source link

support ensembles of class1 allele specific predictors #58

Closed timodonnell closed 7 years ago

timodonnell commented 8 years ago

In our tests ensembles of as many as 64 predictors (perhaps more, haven't tried) show improved accuracy

timodonnell commented 8 years ago

Realized we can kill two birds with one stone here to get ensembles and an estimate of generalization error without sacrificing training data. In the production training run we should train predictors on random subsets of the data (either with or without replacement, perhaps need to experiment to see which works best) and then estimate generalization error by testing on the out-of-bag samples

timodonnell commented 8 years ago

I guess to keep information leaking from the model selection to the estimate of generalization error we'll actually have to do something like this:

For N predictors:

pick a random subsample of training data (say 50% of the data? maybe more?)
do cross validation model selection within this subset to pick a model
save the best model

Our production predictor would then take the average over the N predictors selected above.

To estimate generalization error: for each data point in the training data, take the average of the K < N predictors on it that did not have it in their subsample (i.e. were not trained on it), and calculate the accuracies of those predictions.

On one hand this is nice since we'll be averaging over multiple model architectures. On the other hand, this will be quite expensive. Will definitely need the cloud setup working well.

iskandr commented 7 years ago

Simpler/cheaper alternative:

For each model architecture, perform k-fold cross validation to estimate that model architecture's performance. A final ensemble is just the k models trained for CV.

Maybe this is the same as your original idea? We'll get inflated estimates of accuracy (since we're choosing the max over all model architectures) but will have to train way fewer models.

Another downside is that our ensemble size and number of cross-validation iters become tied.

iskandr commented 7 years ago

For reference, here's the cross-validation + ensemble strategy described in the NetMHCpan 3.0 paper:

Networks were trained in five-fold cross-validation using gradient descent back-propagation with early stopping. Ensembles were generated by training five networks for each data partition and network architecture each starting from a distinct random initial configuration, leading to an ensemble of 10 networks for each data partition, and a total of 50 networks across all partitions. The ensemble trained on all alleles and all peptide lengths will be referred to as the “allmer” method.

timodonnell commented 7 years ago

Closed by #86