openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
191 stars 57 forks source link

Use all alleles and peptides at once #130

Closed saskra closed 5 years ago

saskra commented 6 years ago

Is there a way to train on all supported alleles and a "complete" list of peptides at once, or do I have to train 112 single-allele-predictors? And is there a pretrained model giving me the binding predictions for a list of peptides with all alleles?

saskra commented 6 years ago

I think I might have found the function I was looking for, but it does not work for me. Why?

import pandas
from mhcflurry import Class1AffinityPredictor
from mhcflurry.downloads import get_path

data_path = get_path("data_curated", "curated_training_data.no_mass_spec.csv.bz2")
df = pandas.read_csv(data_path)
df = df.loc[(df.peptide.str.len() >= 8) & (df.peptide.str.len() <= 15)]
new_predictor = Class1AffinityPredictor()

allele_train_data = df.loc[df.allele.str.contains('HLA')]
new_predictor.fit_class1_pan_allele_models(
    n_models=1,
    architecture_hyperparameters={
        "layer_sizes": [16],
        "max_epochs": 5,
        "random_negative_constant": 5,
    },
    alleles=allele_train_data.allele.values,
    peptides=allele_train_data.peptide.values,
    affinities=allele_train_data.measurement_value.values,
    inequalities=allele_train_data.measurement_inequality.values,
    models_dir_for_save='models/pan1',
    verbose=1,
    progress_preamble="",
    progress_print_interval=5.0)

Traceback (most recent call last):
  File "<input>", line 45, in <module>
  File "/anaconda3/envs/mhcflurry2/lib/python3.6/site-packages/mhcflurry/class1_affinity_predictor.py", line 688, in fit_class1_pan_allele_models
    allele_to_fixed_length_sequence=self.allele_to_fixed_length_sequence)
  File "/anaconda3/envs/mhcflurry2/lib/python3.6/site-packages/mhcflurry/allele_encoding.py", line 35, in __init__
    [allele_to_fixed_length_sequence[a] for a in all_alleles],
  File "/anaconda3/envs/mhcflurry2/lib/python3.6/site-packages/mhcflurry/allele_encoding.py", line 35, in <listcomp>
    [allele_to_fixed_length_sequence[a] for a in all_alleles],
TypeError: 'NoneType' object is not subscriptable
timodonnell commented 6 years ago

Yes, this function trains pan-allele models as you noticed. Pan-allele prediction is 'alpha' quality in the codebase. The code should run but the resulting models are generally less accurate than the single allele models for well-characterized alleles. I don't recommend you use them unless you are able to carefully validate the resulting predictors and willing to tune the hyperparameters.

That said, the way to get around the error you're seeing is to pass in a allele_to_fixed_length_sequence argument when you make a Class1AffinityPredictor. This should be a dict of allele names to some fixed length amino acid sequence. One possible choice for this fixed length amino acid sequence are netmhcpan-style pseudosequences, which are available here.

If you end up finding that pan-allele models work well for you we'd be interested to hear about what you tried. Getting pan-allele prediction to be accurate and well-supported has been one of my next tasks but lately I've been tied up with other things.