Closed weipenegHU closed 6 years ago
Thanks for posting this @weipenegHU . I ran the v1.2.1 models on your data and I get a PPV of 0.279:
predictor = mhcflurry.Class1AffinityPredictor.load("/Users/tim/Downloads/models_class1.20180225 (2)/models")
df = pandas.read_csv("/Users/tim/Downloads/mchflurry_test.txt")
df["prediction"] = predictor.predict(peptides=df.peptide.values, alleles=df.allele.values)
sorted_df = df.sort_values("prediction").reset_index(drop=True)
print("PPV: ", (sorted_df.loc[sorted_df.hit > 0].index < len(sorted_df) * 0.001).mean())
Gives:
PPV: 0.279187817259
Mind posting your code so we can try to reconcile the difference? Also I'm curious where this dataset came from that your testing on?
Code:
shell script:
mhcflurry-predict ../test_data/HLA-A2402/mhcflurry_input.txt --out ../test_data/HLA-A2402/mhcflurry_affinity_predictions.csv
python script:
prediction = pd.read_csv("mhcflurry_affinity_predictions.csv") hit = pd.read_csv("mchflurry_test.txt") merge = pd.merge(hit[['peptide','hit']],prediction[['peptide','mhcflurry_prediction']], on = 'peptide') sorted_merge = merge.sort_values('mhcflurry_prediction').reset_index(drop = True) print("PPV: ", (sorted_merge.loc[sorted_merge.hit > 0].index < len(sorted_merge) * 0.001).mean())
Here is my affinity prediction, any chances that we used different models?(Due to size constriction, I just uploaded a sample of the prediction result) mhcflurry_affinity_predictions_sample.txt
The positive data(mass data) is from "MHC class I–associated peptides derive from selective regions of the human genome", which is the independent test data used by netMHCpan4.0, and the negative data are decoys that just randomly selected from proteome
Yeah seems like we may be using different models somehow, my predictions don't seem to match yours. Looking at the top row in the prediction file you sent, you have a prediction of 55.9547162806:
HLA-A2402,ALPSKLPTF,55.9547162806,24.4582190211,191.28437783599998
but I'm seeing 196.299:
$ mhcflurry-predict --alleles HLA-A2402 --peptides ALPSKLPTF
allele,peptide,mhcflurry_prediction,mhcflurry_prediction_low,mhcflurry_prediction_high,mhcflurry_prediction_percentile
HLA-A2402,ALPSKLPTF,196.29918048677726,80.74734568613658,572.4939508616925,1.2248749999999995
Could you send me the output from the predict command above as well as your output for the following commands?
mhcflurry-downloads info
mhcflurry-predict --version
bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head
Thanks
Tim
predict command $ ./bin/mhcflurry-predict --alleles HLA-A2402 --peptides ALPSKLPTF --models ../mhcflurry-master/downloads-generation/models_class1/models
allele,peptide,mhcflurry_prediction,mhcflurry_prediction_low,mhcflurry_prediction_high
HLA-A2402,ALPSKLPTF,55.9547102663,24.4582103458,191.284377836
$./bin/mhcflurry-downloads info(FYI, I downloaded the necessary files manually because of the firewall)
Environment variables
MHCFLURRY_DATA_DIR [unset or empty]
MHCFLURRY_DOWNLOADS_CURRENT_RELEASE [unset or empty]
MHCFLURRY_DOWNLOADS_DIR [unset or empty]
MHCFLURRY_DEFAULT_CLASS1_MODELS [unset or empty]
Configuration
current release = 1.2.0
downloads dir = /home/huweipeng/.local/share/mhcflurry/4/1.2.0 [does not exist]
DOWNLOAD NAME DOWNLOADED? URL
models_class1 NO https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2
models_class1_selected_no_mass_spec NO https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_selected_no_mass_spec.20180225.tar.bz2
models_class1_unselected NO https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_unselected.20180221.tar.bz2
models_class1_minimal NO https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1_minimal.20180226.tar.bz2
data_iedb NO https://github.com/openvax/mhcflurry/releases/download/pre-1.0/data_iedb.tar.bz2
data_published NO http://github.com/openvax/mhcflurry/releases/download/pre-1.1/data_published.tar.bz2
data_systemhcatlas NO http://github.com/openvax/mhcflurry/releases/download/pre-1.1/data_systemhcatlas.tar.bz2
data_curated NO https://github.com/openvax/mhcflurry/releases/download/pre-1.2/data_curated.20180219.tar.bz2
mhcflurry-predict --version
mhcflurry 1.2.0
bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head
Thu Aug 3 14:49:28 EDT 2017
+ date
+ pip freeze
appdirs==1.4.0
backports.weakref==1.0rc1
biopython==1.68
bioseq==0.1.0
bitmath==1.3.1.2
bleach==1.5.0
bottle==0.12.9
It looks like you're using old models.
This is the output you want to see for the latest models (note the Sat Feb 24 date):
$ bzcat "$(mhcflurry-downloads path models_class1)/LOG.txt.bz2" | head
+ date
Sat Feb 24 15:02:08 EST 2018
+ pip freeze
absl-py==0.1.10
alabaster==0.7.10
anaconda-client==1.6.5
anaconda-navigator==1.6.9
anaconda-project==0.8.0
appdirs==1.4.3
asn1crypto==0.22.0
Can you try downloading https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2 , unpacking it, and running predict using those models?
Yeah, I got the correct result this time. However I don't know what you mean by saying old models, it seems that the v0.9.2 models work fine for me. Another problem it's I failed to run mass models through the 0.9.2 predict command, here is the output,
Traceback (most recent call last):
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/bin/mhcflurry-predict", line 11, in <module>
sys.exit(run())
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/predict_command.py", line 151, in run
predictor = Class1AffinityPredictor.load(models_dir)
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_affinity_predictor.py", line 199, in load
model = Class1NeuralNetwork.from_config(config, weights=weights)
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_neural_network.py", line 212, in from_config
instance = cls(**config.pop('hyperparameters'))
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/class1_affinity_prediction/class1_neural_network.py", line 101, in __init__
hyperparameters)
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/hyperparameters.py", line 58, in with_defaults
self.check_valid_keys(obj)
File "/hwfssz1/BIGDATA_COMPUTING/software/tools/anaconda/lib/python3.6/site-packages/mhcflurry/hyperparameters.py", line 87, in check_valid_keys
" ".join(self.defaults)))
ValueError: No such model parameters: minibatch_size peptide_amino_acid_encoding train_data allele_dense_layer_sizes peptide_dense_layer_sizes peptide_allele_merge_method peptide_allele_merge_activation learning_rate. Valid parameters are: kmer_size use_embedding embedding_input_dim embedding_output_dim pseudosequence_use_embedding layer_sizes dense_layer_l1_regularization dense_layer_l2_regularization activation init output_activation dropout_probability batch_normalization embedding_init_method locally_connected_layers loss optimizer left_edge right_edge max_epochs take_best_epoch validation_split early_stopping random_negative_rate random_negative_constant random_negative_affinity_min random_negative_affinity_max random_negative_match_distribution random_negative_distribution_smoothing patience monitor min_delta verbose mode
That's correct, in general you can use newer versions of the mhcflurry software with older models, but not the other way around. New models tend to use architecture features not available in older versions of mhcflurry. I would recommend using the latest MHCflurry version regardless of what models you decide to go with.
If you notice any accuracy differences between different version of MHCflurry using the same models, please let me know - that would likely be a bug.
I'm seeing that having models versioned separately from the codebase (e.g. you can use the 0.9.2 models with mhcflurry 1.2.0) is pretty confusing. I'll think of a way to clarify this situation for other users.
problem solved, thanks :)
Hi,I found the huge performance difference between v0.9.2 and v1.2.1 when I calculate 0.1%PPV on A2402 test dataset (sorted the results according to the predicted aff, and extracted the top 0.1% to cal the true positive rate, e.g, select 591 records from 591000 records). PPV 0.9.2:0.32 PPV 1.2.1:0 And I found I couldn't use mass models in v0.9.2. I have attached the test data here for problem replication mchflurry_test.zip
the models I used in v0.9.2 are downloaded from :https://github.com/openvax/mhcflurry/releases/download/0.9.2/models_class1.tar.bz2 the models I used in v1.2.1 are downloaded from :https://github.com/openvax/mhcflurry/releases/download/pre-1.2/models_class1.20180225.tar.bz2