amino acids sometimes doesn't match the reference or variant

zktuong commented 11 months ago

Hi,

We (@ali-harasty and myself) are getting confused by some of the results in the final neoantigen results table e.g.MHC_Class_I_all_epitopes_ccf_ref file.

Some of the mutations makes sense but some don't.

For example in this following screenshot, the first row is a C->G nucleotide mutation, resulting in a Q->E mutation at the amino acid level: Q can be CAA or CAG and E can be GAA or GAG. so the change from the first C to the first G makes perfect sense. The reference sequence matching the mutated codon sequence can also be found (GRCh38.d1.vd1.fa from the nextNEOpi resource bundle).

However, if you look at the second one, it's a G->A for a R->C mutation

However, there's no way this can happen as there's no A in the codon encoding Cysteine. R: CGT, CGC, CGA, CGG, AGA, AGG C: TGT, TGC

You see this occuring a couple more times where it's just not possible.

Sometimes the reference is also not possible e.g. the 4th row's mutation (C->G/R->T). The variant here is C and on the reference the possible flanking codon sequences are 'CTT', 'TCT', 'GTC' (Stop column position is the correct position) but none of these match to R. The variant here is also not possible.

R: CGT, CGC, CGA, CGG, AGA, AGG T: ACT, ACC, ACA, ACG

So i went through the whole table and it's just not possible for a lot of the results:

import pandas as pd
from pyfaidx import Fasta

# import reference and data
ref = Fasta(
    filename="/scratch/project/tcr_neoantigen/resources/references/hg38/gdc/GRCh38.d1.vd1/fasta/GRCh38.d1.vd1.fa"
)
df = pd.read_csv(
    "/scratch/project/tcr_neoantigen/results/blood_results_171123/neoantigens/bloodXTumour/Class_I/bloodXTumour_MHC_Class_I_all_epitopes_ccf_ref_match.tsv",
    sep="\t",
)
# drop to unique Stop, Reference, Variant rows
df = df.drop_duplicates(["Stop", "Reference", "Variant"])
df.to_csv(
    "/scratch/project/tcr_neoantigen/results/blood_results_171123/neoantigens/bloodXTumour/Class_I/bloodXTumour_MHC_Class_I_all_epitopes_ccf_ref_match_small.tsv",
    sep="\t",
    index=False,
)
codon_dict = {
    "S": ["TCT", "TCC", "TCA", "TCG", "AGT", "AGC"],
    "L": ["TTA", "TTG", "CTT", "CTC", "CTA", "CTG"],
    "C": ["TGT", "TGC"],
    "W": ["TGG"],
    "E": ["GAA", "GAG"],
    "D": ["GAT", "GAC"],
    "P": ["CCT", "CCC", "CCA", "CCG"],
    "V": ["GTT", "GTC", "GTA", "GTG"],
    "N": ["AAT", "AAC"],
    "M": ["ATG"],
    "K": ["AAA", "AAG"],
    "Y": ["TAT", "TAC"],
    "I": ["ATT", "ATC", "ATA"],
    "Q": ["CAA", "CAG"],
    "F": ["TTT", "TTC"],
    "R": ["CGT", "CGC", "CGA", "CGG", "AGA", "AGG"],
    "T": ["ACT", "ACC", "ACA", "ACG"],
    "*": ["TAA", "TAG", "TGA"],
    "A": ["GCT", "GCC", "GCA", "GCG"],
    "G": ["GGT", "GGC", "GGA", "GGG"],
    "H": ["CAT", "CAC"],
}

mut_list = list(df["Stop"])
chrom_list = (list(df["Chromosome"]),)
ref_list = list(df["Reference"])
var_list = list(df["Variant"])
aa_change_list = list(df["Mutation"])

seq_possible_result, mut_possible_result = [], []
for i in range(len(mut_list)):
    var = var_list[i]
    aa_change = aa_change_list[i]
    aa_ref = aa_change.split("/")[0]
    aa_mut = aa_change.split("/")[1]
    seq_possible, mut_possible = [], []
    seqx, mutx = [],[]
    if (len(aa_ref) == 1) & (len(aa_mut) == 1):  # focus on SNV for now
        if aa_mut != "X":  # unknown change?
            codon_ref = codon_dict[aa_ref]
            codon_mut = codon_dict[aa_mut]
            # get the 3 possible codons:
            # x - -
            # - X -
            # - - X
            start_pos_shift_dict = {0: 0, 1: -1, 2: -2}
            start_pos_shift_dict_for_mut = {0: 0, 1: 1, 2: 2}
            end_pos_shift_dict = {0: 2, 1: 1, 2: 0}
            end_pos_shift_dict_for_mut = {0: 1, 1: 2, 2: 3}
            for pos_shift in [0, 1, 2]:
                seq = str(
                    ref.get_seq(
                        str(chrom_list[0][i]),
                        mut_list[i] + start_pos_shift_dict[pos_shift],
                        mut_list[i] + end_pos_shift_dict[pos_shift],
                    )
                )
                mut = (
                    seq[: start_pos_shift_dict_for_mut[pos_shift]]
                    + var
                    + seq[end_pos_shift_dict_for_mut[pos_shift] :]
                )
                seq_possible.append(seq in codon_ref)
                mut_possible.append(mut in codon_mut)
    seq_possible_result.append(1 if any(seq_possible) else 0)
    mut_possible_result.append(1 if any(mut_possible) else 0)

possible = pd.DataFrame([seq_possible_result, mut_possible_result]).T
possible.columns = ["ref_possible", "mut_possible"]
possible.to_csv("/scratch/project/tcr_neoantigen/results/blood_results_171123/neoantigens/bloodXTumour/Class_I/bloodXTumour_MHC_Class_I_check_possible.tsv", sep="\t", index=False,)

I've attached the results here and you can take a look Archive.zip

Are we missing something?

riederd commented 11 months ago

Yes, I think you are indeed missing something:

Please note that almost all of your mutations in questions are on transcripts that are coded on the - strand of the corresponding DNA so you need to check the reverse complement.

The remaining mutations in question (which are on the + strand) are resulting from indels/frameshifts for which your code defaults to (0, 0).

For your reference I attached a table with the strand information of the transcripts that you are suspecting incorrect variants: mart_export_GH_68.txt

HTH

zktuong commented 11 months ago

oh ok thanks! that makes a lot of sense!

icbi-lab / nextNEOpi

amino acids sometimes doesn't match the reference or variant #68