openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
194 stars 58 forks source link

Training models #166

Closed amomin-pact closed 4 years ago

amomin-pact commented 4 years ago

Hi , I am trying to train models for Class1 prediction for allele specific model using mhcflurry-class1-train-allele-specific-models. Here are some of the questions I have. 1) What is the format of the input data to train the models? What are the essential columns versus optional? 2) If the data doesn't have binding affinity measurements what is an appropriate method to assign values. I remember for the previous version (1.4) there were instruction to set default values < 500 for MS peptides. How can I find the instructions from older version? Are those instructions still available for v1.6. 3) What is the best background or control data can be used for MS data?

Thanks Amin

timodonnell commented 4 years ago

Hi there,

The best source for this information is to take a look at the scripts used to generate these models:

https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1_unselected/GENERATE.sh

https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1/GENERATE.sh

To see the results of these commands, run mhcflurry downloads fetch models_class1_unselected and then inspect the files in the directory given by mhcflurry downloads path models_class1_unselected.

To try to answer your specific questions -

1 - To see an example of the training data, run:

$ mhcflurry-downloads fetch data_curated
...
$ bzcat "$(mhcflurry-downloads path data_curated)/curated_training_data.csv.bz2" | head
allele,peptide,measurement_value,measurement_inequality,measurement_type,measurement_kind,measurement_source,original_allele
BoLA-1*21:01,AENDTLVVSV,7817.0,=,quantitative,affinity,Barlow - purified MHC/competitive/fluorescence,BoLA-1*02101
BoLA-1*21:01,NQFNGGCLLV,1086.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-1*02101
BoLA-2*08:01,AAHCIHAEW,21.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801
BoLA-2*08:01,AAKHMSNTY,1299.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801
BoLA-2*08:01,DSYAYMRNGW,2.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801
BoLA-2*08:01,HTTNTQNNDW,40.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801
BoLA-2*08:01,KVYANIAPTY,10000.0,>,quantitative,affinity,Barlow - purified MHC/competitive/fluorescence,BoLA-2*00801
BoLA-2*08:01,KVYNPPRTNY,393.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801
BoLA-2*08:01,LAAKHMSNT,1380.0,=,quantitative,affinity,Barlow - purified MHC/direct/fluorescence,BoLA-2*00801

The required columns are allele,peptide,measurement_value. The measurement_inequality column is optional; if omitted all measurements is interpreted as having an "=" value.

2 - Yes, all versions since at least 1.2.0 can accept a "<" measurement inequality and this is the recommended way to incorporate mass spec data, i.e. set the measurement_inequality to "<" and the measurement_value to 500. If you have no affinity measurements at all you can also just set all mass spec hits to 0 nM. Your results won't be calibrated nM affinities but without any affinity data you won't have calibrated affinities in any case. There should be no changes for how you do this across versions.

3 - What we do in the production models is just use the random negative mechanism to provide the background. To do this set the "random_negative_rate" hyperparameter to something greater than 0 (e.g. 0.5 will add n * 0.5 random negatives where n is the number of hits). Then your training data can just be positive examples only. Another alternative is to add unobserved peptides to the training data and give them e.g. a > 20000 measurement value.

If you still hit issues let us know a bit more about your use case and we can try to help more.

Tim

amomin-pact commented 4 years ago

@timodonnell I have looked at Generate.sh files for models_class1_unselected and models_class1 for training allele specific model. The models_class1_unselected/GENERATE.sh has only 'mhcflurry-class1-train-allele-specific-models'. On the other hand models_class1/GENERATE.sh has write_validation_data.py, mhcflurry-class1-select-allele-specific-models and mhcflurry-calibrate-percentile-ranks.

Does the models_class1/GENERATE.sh perform model selection on a previously made model (models_class1_unselected)? Can you provide a chronological order of steps one needs to do to train and select a model.

I get an error when I run /models_class1/GENERATE.sh FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.local/share/mhcflurry/4/2.0.0/data_curated//curated_training_data.with_mass_spec.csv.bz2' The curated_training_data.with_mass_spec.csv.bz2 file is missing in my data_curated download. I tried to do a manual download using the wget instruction.

Additionally, does the output file (test.csv) need to include the mass spec or another sources data in order to optimize the model for that information? How can one add the additional information to the build the test.csv file ?

timodonnell commented 4 years ago

Does the models_class1/GENERATE.sh perform model selection on a previously made model (models_class1_unselected)?

Yes - that's correct. 

There is some more information in this book chapter, which may be helpful to you: https://link.springer.com/protocol/10.1007/978-1-0716-0327-7_8

I would recommend using MHCflurry version 1.2.4 to train these models, as after that everything has switched to pan-allele models. I think that is what is causing the error you are seeing involving the missing file. After switching to that version, re-download the data_curated data so you get the version that was used in that release. That should fix the missing file issue, which was an incompatibility I accidentally introduced while adding the pan-allele predictors.

Alternatively - you could train pan-allele predictors, which are the focus of current mhcflurry development.

Hope that helps.

Tim

amomin-pact commented 4 years ago

Thanks for your valuable feedback. I looked into training a Pan-allele model. I tried running the downloads-generation/models_class1_pan/GENERATE.sh to test one of the pan model. However, its slow on a non GPU instance. Also what are the key difference between models_class1_pan and others such as (models_class1_pan_variants, models_class1_presentation). Here are a few question based on my initial run 1) Are there any recommendations regarding the configuration of GPU instance to be use in testing and training? 2) Which of these pan models are most appropriate for training?

On the other hand while trying to install mhcflurry 1.2.4 I have been encountering tensorflow error AttributeError: module 'tensorflow' has no attribute 'ConfigProto' . Does mhcflurry 1.2.4 require a specific TF version ? I have tried the older Tensorflow 1.15 as well as the new Tensorflow 2.2.0. I guess if the code is old it may have incompatibility with newer python libraries.