openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
191 stars 57 forks source link

docker #173

Closed amomin-pact closed 4 years ago

amomin-pact commented 4 years ago

Hi I just saw the new docker container for mhcflurry. Does one exist for MHCflutty 1.6.0? I didn't see a tag for other versions in dockerhub/Builds.

Additionally, do we download the various models when we create the image from the dockerfile?

Thanks Amin

timodonnell commented 4 years ago

We don't have it for MHCflurry 1.6 unfortunately. It was added as part of the 2.0 release. The models do get downloaded when the image is created and are part of the image. Let me know if you have any issues using it.

amomin-pact commented 4 years ago

Hello Tim, Looking at the release notes, it doesnt seem you have done major changes from v1.6, except updating the model training criteria and porting the code 2.0 (that's a big one). If one downloads the models for v1.6 does it still get the original v1.6 models. I assume the models files are saved by individual versions.

Thanks Amin

timodonnell commented 4 years ago

That's right. If you pip install mhcflurry 1.6 you'll get the models for 1.6.

It should also be possible to use the models from 1.6 with the mhcflurry 2.0 codebase by downloading the models separately (see the URLs in downloads.yml) and then passing --models when you call predict, see here for an example. But I think that may only work for affinity prediction and not processing prediction due to model serialization changes in tensorflow 2.

amomin-pact commented 4 years ago

Hello Tim, Thanks for the update. I also see that you have a notebook section with the new release. I would appreciate if you can have a notebook depicting the steps in the model training. It would be great to see how the data is prepared and model training scripts are executed.

Thanks Amin

timodonnell commented 4 years ago

Good idea - can look into adding an example of model training as a notebook.

In terms of training the production models that are available for download, you need a cluster with GPUs do this in a reasonable amount of time, but the scripts used are here:

Affinity predictor: https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1_pan/GENERATE.sh

AP predictor: https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1_processing/GENERATE.sh

PS predictor: https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1_presentation/GENERATE.sh

amomin-pact commented 4 years ago

Hello Tim,

Thanks for your response. I was looking up the code to generate the models. https://github.com/openvax/mhcflurry/blob/master/downloads-generation/models_class1/GENERATE.sh

Does write_validation_data.py generate the data for building the model with mhcflurry-class1-select-allele-specific-models ?

Another question is how does one determine the model accuracy compare to an older/other models? I see AUC and PPV metrix in your manuscript (CellSystems 2020). Is that code available in your repo. I will highly appreciate if you can point the location.

Thanks

timodonnell commented 4 years ago

Those are actually the old allele-specific models. The new pan-allele models are generated in:

https://github.com/openvax/mhcflurry/tree/master/downloads-generation/models_class1_pan

If you do want to fit allele-specific models, have a look at the models_class1_unselected download, which fits a large number of possible models for each allele. The models_class1 download that you mentioned is doing the model selection, based on validation data that is written out using the write_validation_data.py script.

For your second question, I compute AUC using the scikit learn routine: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Here is code to compute PPV:

def ppv(y_true, predictions):
    df = pandas.DataFrame({"prediction": predictions, "y_true": y_true})
    return df.sort_values("prediction", ascending=False)[:int(y_true.sum())].y_true.mean()

Hope that helps.