griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
129 stars 58 forks source link

Method for decomposing MT-WT aa sequences #1112

Closed nttvy closed 16 hours ago

nttvy commented 1 month ago

Dear pVACtools authors,

Thank you so much for the impressive work. We have used the tools in many of our analyses. We are now working on a specific analysis in which we need to decompose the long MT and WT sequences into short MT and WT sequences. May we know in detail about how this was done? An indication of the specific functions/methods would be great.

Best regards, Vy

susannasiebert commented 1 month ago

I'm not sure if I'm understanding your question correctly but retrieving short sequence windows from a longer sequence can be done via a python loop:

    def determine_neoepitopes(self, sequence, length):
        epitopes = []
        for i in range(0, len(sequence)-length+1):
            epitopes.append(sequence[i:i+length])
        return epitopes

This example takes advantage of the fact that Python strings are treated like arrays of character so you can extract any substring the same way you can get a subset from an array: long_string[start:end]

nttvy commented 1 month ago

Dear @susannasiebert

Thanks for your response. I was asking about another thing. I realized that not all possible short MT sequences were reported and paired with short WT, specifically in the case of insertions and deletions. For example, for the MT sequence AAAAAAANILSSSCGAPSPTKPKTKPTWRCE and the WT sequence AAAAAAANILSSSTKPKTKPTWRCE, the sequences CGAPSPTKPK, SSSCGAPSPT are not paired. I would like to know more details about these cut and pairing rules.

Best regards, Vy

susannasiebert commented 1 month ago

A lot of this logic can be found in pvactools/lib/output_parser.py but this logic is dependent on the format of the output tmp files from the prediction calls.

To preface this, pVACseq will always return all possible wildtype subsequences.

For indels, we first determine the best match since there can be two candidates when the epitope fully overlaps the indel. In that scenario the two candidates are the ones that left anchor and right anchor to the mutant epitope. In your example those would be: candidate 1: SSSCGAPSPT MT SSSTKPKTKP WT1 candidate 2: SSSCGAPSPT MT AAANILSSST WT2 Between these two candidates SSSTKPKTKP and AAANILSSST pVACseq selects the one with the longer stretch of matching amino acids, SSSTKPKTKP.

We then check how many amino acids are different between the mutant epitope and the selected candidate wildtype. If more than half of the amino acids of the mutant epitope are different we do not select a matched wildtype (NA). In your example your epitope length is 10 so this would be 5 amino acids or more. In your case there is a 7 amino acid difference between SSSCGAPSPT and SSSTKPKTKP: the 6 amino acid insertion plus the last position so there is no matched wildtype.

susannasiebert commented 16 hours ago

Closing this issue due to inactivity but please do reopen if you have additionally questions about the algorithm.