HLA Pseudosequence generation

Hugh-OBrien commented 2 years ago

Thanks for making the test code for the tool available.

I have a query regarding how the HLA pseudosequences are generated

Here there are hard coded indexes for generating the pseudo sequences; however, my understanding was that an alignment was needed before using these indexes since the HLAs in the fastas you've used are of varying length. After this the indexes from the original netMHCpan paper describing the method wouldn't necessarily be correct for your HLA sequences.

If you look at the HLA analysis in netMHCpan (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000796) the pseudosequences have an expected pattern which I don't think holds using your method, indicating you're not using the same pseudosequences they are (at least at test time).

MHCFlurry used what looks like a similar set of HLA fastas to you and after their alignment they start at index 31 not 7

Did you use a different method for training? If not it could be possible that the network is mainly performing an accurate match between peptide-TCR. The HLAs are still being encoded, but not in a way which preserves the likely contact points.

Apologies if I've missed part of the implementation which addresses this!

tianshilu commented 2 years ago

Hi @Hugh-OBrien ，

Thank you for carefully looking into the code and pointing out the problem! After reviewing the pseudo-seq method in both netMHCpan and MHCflurry, I think you are right that we need to do alignments for HLA sequences before taking pseudo sequences. Even though we had the pseudo-sequences not as same as netMHCpans, the performance is still good compared to netMHCpan. I will comment on this caveat in the code. We are also working on pMTnet version 2.0. We will take this caveat into consideration. Thank you again for looking into this and letting us know!

Tianshi

Hugh-OBrien commented 2 years ago

Thank you for the reply. Yes on my quick retraining test I found there was a modest performance hit on the MHC-peptide binding affinity; however, as this is only the encoder branch it may not end up affecting the overall performance of the overall model in a significant way. Would be good to add in a caveat in a comment as it may increase the performance of future work

tianshilu commented 2 years ago

I totally agree with you. Thanks again for your input!

tianshilu / pMTnet

HLA Pseudosequence generation #8