openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
191 stars 57 forks source link

huge difference of performance between v0.9.2 and v.1.2.1 #124

Closed weipenegHU closed 6 years ago

weipenegHU commented 6 years ago

Hi,I found the huge performance difference between v0.9.2 and v1.2.1 when I calculate 0.1%PPV on A2402 test dataset (sorted the results according to the predicted aff, and extracted the top 0.1% to cal the true positive rate, e.g, select 591 records from 591000 records). PPV 0.9.2:0.32 PPV 1.2.1:0 And I found I couldn't use mass models in v0.9.2. I have attached the test data here for problem replication mchflurry_test.zip

the models I used in v0.9.2 are downloaded from :https://github.com/openvax/mhcflurry/releases/download/0.9.2/models_class1.tar.bz2 the models I used in v1.2.1 are downloaded from :https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2

timodonnell commented 6 years ago

Thanks for posting this @weipenegHU . I ran the v1.2.1 models on your data and I get a PPV of 0.279:

predictor = mhcflurry.Class1AffinityPredictor.load("/Users/tim/Downloads/models_class1.20180225 (2)/models")
df = pandas.read_csv("/Users/tim/Downloads/mchflurry_test.txt")
df["prediction"] = predictor.predict(peptides=df.peptide.values, alleles=df.allele.values)

sorted_df = df.sort_values("prediction").reset_index(drop=True)
print("PPV: ", (sorted_df.loc[sorted_df.hit > 0].index < len(sorted_df) * 0.001).mean())

Gives:

PPV:  0.279187817259

Mind posting your code so we can try to reconcile the difference? Also I'm curious where this dataset came from that your testing on?

weipenegHU commented 6 years ago

Code: shell script: mhcflurry-predict ../test_data/HLA-A2402/mhcflurry_input.txt --out ../test_data/HLA-A2402/mhcflurry_affinity_predictions.csv python script: prediction = pd.read_csv("mhcflurry_affinity_predictions.csv") hit = pd.read_csv("mchflurry_test.txt") merge = pd.merge(hit[['peptide','hit']],prediction[['peptide','mhcflurry_prediction']], on = 'peptide') sorted_merge = merge.sort_values('mhcflurry_prediction').reset_index(drop = True) print("PPV: ", (sorted_merge.loc[sorted_merge.hit > 0].index < len(sorted_merge) * 0.001).mean())

Here is my affinity prediction, any chances that we used different models?(Due to size constriction, I just uploaded a sample of the prediction result) mhcflurry_affinity_predictions_sample.txt

The positive data(mass data) is from "MHC class I–associated peptides derive from selective regions of the human genome", which is the independent test data used by netMHCpan4.0, and the negative data are decoys that just randomly selected from proteome

timodonnell commented 6 years ago

Yeah seems like we may be using different models somehow, my predictions don't seem to match yours. Looking at the top row in the prediction file you sent, you have a prediction of 55.9547162806:

HLA-A2402,ALPSKLPTF,55.9547162806,24.4582190211,191.28437783599998

but I'm seeing 196.299:

$ mhcflurry-predict --alleles HLA-A2402 --peptides ALPSKLPTF
allele,peptide,mhcflurry_prediction,mhcflurry_prediction_low,mhcflurry_prediction_high,mhcflurry_prediction_percentile
HLA-A2402,ALPSKLPTF,196.29918048677726,80.74734568613658,572.4939508616925,1.2248749999999995

Could you send me the output from the predict command above as well as your output for the following commands?

mhcflurry-downloads info
mhcflurry-predict --version
bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head

Thanks

Tim

weipenegHU commented 6 years ago

predict command $ ./bin/mhcflurry-predict --alleles HLA-A2402 --peptides ALPSKLPTF --models ../mhcflurry-master/downloads-generation/models_class1/models

allele,peptide,mhcflurry_prediction,mhcflurry_prediction_low,mhcflurry_prediction_high
HLA-A2402,ALPSKLPTF,55.9547102663,24.4582103458,191.284377836

$./bin/mhcflurry-downloads info(FYI, I downloaded the necessary files manually because of the firewall)

Environment variables
  MHCFLURRY_DATA_DIR                  [unset or empty]
  MHCFLURRY_DOWNLOADS_CURRENT_RELEASE [unset or empty]
  MHCFLURRY_DOWNLOADS_DIR             [unset or empty]
  MHCFLURRY_DEFAULT_CLASS1_MODELS     [unset or empty]

Configuration
  current release                     = 1.2.0                
  downloads dir                       = /home/huweipeng/.local/share/mhcflurry/4/1.2.0 [does not exist]

DOWNLOAD NAME                             DOWNLOADED?   URL                  
models_class1                             NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2 
models_class1_selected_no_mass_spec       NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_selected_no_mass_spec.20180225.tar.bz2 
models_class1_unselected                  NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_unselected.20180221.tar.bz2 
models_class1_minimal                     NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_minimal.20180226.tar.bz2 
data_iedb                                 NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.0/data_iedb.tar.bz2 
data_published                            NO            http://github.com/openvax/mhcflurry/releases/download/pre-1.1/data_published.tar.bz2 
data_systemhcatlas                        NO            http://github.com/openvax/mhcflurry/releases/download/pre-1.1/data_systemhcatlas.tar.bz2 
data_curated                              NO            https://github.com/openvax/mhcflurry/releases/download/pre-1.2/data_curated.20180219.tar.bz2 

mhcflurry-predict --version

mhcflurry 1.2.0

bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head

Thu Aug  3 14:49:28 EDT 2017
+ date
+ pip freeze
appdirs==1.4.0
backports.weakref==1.0rc1
biopython==1.68
bioseq==0.1.0
bitmath==1.3.1.2
bleach==1.5.0
bottle==0.12.9
timodonnell commented 6 years ago

It looks like you're using old models.

This is the output you want to see for the latest models (note the Sat Feb 24 date):

$ bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head
+ date
Sat Feb 24 15:02:08 EST 2018
+ pip freeze
absl-py==0.1.10
alabaster==0.7.10
anaconda-client==1.6.5
anaconda-navigator==1.6.9
anaconda-project==0.8.0
appdirs==1.4.3
asn1crypto==0.22.0

Can you try downloading https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2 , unpacking it, and running predict using those models?

weipenegHU commented 6 years ago

Yeah, I got the correct result this time. However I don't know what you mean by saying old models, it seems that the v0.9.2 models work fine for me. Another problem it's I failed to run mass models through the 0.9.2 predict command, here is the output,

Traceback (most recent call last):
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/bin/mhcflurry-predict", line 11, in <module>
    sys.exit(run())
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/predict_command.py", line 151, in run
    predictor = Class1AffinityPredictor.load(models_dir)
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_affinity_predictor.py", line 199, in load
    model = Class1NeuralNetwork.from_config(config, weights=weights)
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_neural_network.py", line 212, in from_config
    instance = cls(**config.pop('hyperparameters'))
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_neural_network.py", line 101, in __init__
    hyperparameters)
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/hyperparameters.py", line 58, in with_defaults
    self.check_valid_keys(obj)
  File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/hyperparameters.py", line 87, in check_valid_keys
    " ".join(self.defaults)))
ValueError: No such model parameters: minibatch_size peptide_amino_acid_encoding train_data allele_dense_layer_sizes peptide_dense_layer_sizes peptide_allele_merge_method peptide_allele_merge_activation learning_rate. Valid parameters are: kmer_size use_embedding embedding_input_dim embedding_output_dim pseudosequence_use_embedding layer_sizes dense_layer_l1_regularization dense_layer_l2_regularization activation init output_activation dropout_probability batch_normalization embedding_init_method locally_connected_layers loss optimizer left_edge right_edge max_epochs take_best_epoch validation_split early_stopping random_negative_rate random_negative_constant random_negative_affinity_min random_negative_affinity_max random_negative_match_distribution random_negative_distribution_smoothing patience monitor min_delta verbose mode
timodonnell commented 6 years ago

That's correct, in general you can use newer versions of the mhcflurry software with older models, but not the other way around. New models tend to use architecture features not available in older versions of mhcflurry. I would recommend using the latest MHCflurry version regardless of what models you decide to go with.

If you notice any accuracy differences between different version of MHCflurry using the same models, please let me know - that would likely be a bug.

I'm seeing that having models versioned separately from the codebase (e.g. you can use the 0.9.2 models with mhcflurry 1.2.0) is pretty confusing. I'll think of a way to clarify this situation for other users.

weipenegHU commented 6 years ago

problem solved, thanks :)