Add class1 ensemble models, with model selection using imputation

timodonnell commented 7 years ago

This PR adds support for ensembles of single-allele class1 predictors, trained on random halves of each allele's data. A downloadable set of ensembles with 16 models per ensemble is included, supporting 132 alleles (2112 models in total). Each model in an allele's ensemble was selected as the top-performing model (by sum of AUC, F1, and Tau) in model selection over 160 architectures. Imputation was considered a binary feature of the architecture; overall about half the models selected used imputation.

To test this out in the current branch you can run:

mhcflurry-downloads fetch models_class1_allele_specific_ensemble
mhcflurry-predict --alleles HLA-A0201 HLA-A0301 --peptides SIINFEKL SIINFEKD SIINFEKQ --predictor class1-allele-specific-ensemble

I'm leaving the existing single model predictors as the default right now. We can switch the default to ensembles once we have a mass-spec-based assessment of their quality, which should be soon.

There's a lot here @iskandr so it may make sense to go over in person sometime.

coveralls commented 7 years ago

Coverage increased (+1.8%) to 77.991% when pulling 1475833dfd2b0914ef444112713e96b525dbb2a1 on add-class1-ensemble into 34af636f51fdbdd5ad14e6f1884f636a58ce4477 on master.

coveralls commented 7 years ago

Coverage increased (+1.8%) to 77.972% when pulling 275b4b5a0f96e21dfe4f6928428aaa3667e974e7 on add-class1-ensemble into 34af636f51fdbdd5ad14e6f1884f636a58ce4477 on master.

iskandr commented 7 years ago

Long-term concerns:

Will the extensive work-splitting code in the new ensemble class also get used in some form by parallelism for training other model types? Also, will the measurement collection remain redundant with the affinity data set?

For now, just comment the crap out of it!

coveralls commented 7 years ago

Coverage increased (+2.3%) to 78.493% when pulling e0c80756c14f6bd8794b7daff9c49cf3d9353ed9 on add-class1-ensemble into 34af636f51fdbdd5ad14e6f1884f636a58ce4477 on master.

coveralls commented 7 years ago

Coverage increased (+2.3%) to 78.481% when pulling e0c80756c14f6bd8794b7daff9c49cf3d9353ed9 on add-class1-ensemble into 34af636f51fdbdd5ad14e6f1884f636a58ce4477 on master.

timodonnell commented 7 years ago

thanks for the review @iskandr , updated with a lot more documentation. Going to merge momentarily and cut a release

timodonnell commented 7 years ago

Closing in favor of #84

openvax / mhcflurry

Add class1 ensemble models, with model selection using imputation #83