openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
191 stars 57 forks source link

fail to fetch mass-spec model #123

Closed weipenegHU closed 5 years ago

weipenegHU commented 6 years ago

I am trying to download models trained with mass-spec data, when I run the command"mhcflurry-downloads fetch models_class1_trained_with_mass_spec", it outputs the following error "models_class1_trained_with_mass_spec. Valid downloads are: models_class1, models_class1_selected_no_mass_spec, models_class1_unselected, models_class1_minimal, data_iedb, data_published, data_systemhcatlas, data_curated", the problem also happened in v0.9.2

timodonnell commented 6 years ago

This was added in master but I hadn't pushed a new version to PyPI yet. I just pushed version 1.2.1 to PyPI now though - try it using this latest version?

weipenegHU commented 6 years ago

OK, I will give it a shot

weipenegHU commented 6 years ago

May I ask that the mass-spec model is trained on pure mass-spec data or is mixed with affinity data, and how you process the measurement value considering the mass-spec data is a binary value? I want to train MHCflurry with mass-spec data, are there any problems if I use the model structure that used to train on affinity data without any modifications?

timodonnell commented 6 years ago

It's a mix of mass spec and affinity data. Here is some (currently unpublished) info on what we do:

MHCflurry predicts quantitative binding affinities, but one fourth of the entries (57,828 of 230,735) in the affinity dataset are qualitative, represented as positive, positive-high, positive-intermediate, positive-low, or negative. To use these measurements for training, the MHCflurry models are trained using a modification to the mean square error (MSE) loss function, in which measurements may be associated with an inequality, (>) or (<), and contribute to the loss only when the inequality is violated. For example, we assigned measurements represented as positive-high the value “< 100 nM”, as such peptides are likely to have binding affinities tighter than (i.e. less than) 100 nM. During training, these peptides contribute to the loss only when their predictions are greater than 100 nM. For the MHCflurry (train-MS) predictor, this approach is used to include MS-identified ligands, which are assigned a “< 500 nM” value.

These inequalities are only necessary though if you're trying to combine affinity measurements with mass spec. If you just want to train on mass spec only you don't have to worry about that. You can just set the mass spec hits to some strong affinity (e.g. 1 nM) and then to have some negative examples either include decoys with a low affinity (20,000 nM) or set the random negative peptide rate to be high. Some more info on the latter:

At each epoch, 25 synthetic negative peptides for each length 8-15 are randomly generated. These random negative peptides are sampled so as to have the same amino acid distribution as the training peptides and are assigned affinities >20,000 nM. For the MHCflurry (train-MS) variant, the number of random peptides for each length is 0.2n + 25where n is the number of training peptides.

The hyperparameters we use for the models that include mass spec in the training data are generated here: https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1_unselected_with_mass_spec/generate_hyperparameters.py

weipenegHU commented 6 years ago

thank you! Your work is so fascinating and meaningful!