taylor-lab / neoantigen-dev

neoantigen prediction from WES/WGS
4 stars 1 forks source link

Error: not one of the known HGVSc strings: c.-1_1dupAA #7

Open evanbiederstedt opened 4 years ago

evanbiederstedt commented 4 years ago

Let's make sure this wasn't a one-off error. CC @gongyixiao @kpjonsson

If never seen again, we can ignore .

kpjonsson commented 4 years ago

It's not matching this regex: https://github.com/taylor-lab/neoantigen-dev/blob/62f75a8ba911e2675dfe48e180c41b9cff760126/neoantigen.py#L691

I'm not sure whether this is a misformatted string or not. @gongyixiao, was this using a MAF annotated with vcf2maf/maf2maf?

kpjonsson commented 4 years ago

@cband Is this a type of HGVSc string that should be captured by the regex?

This is the variant:

Chromosome          20
Start_Position    18446001
End_Position    18446002
Reference_Allele           -
Tumor_Seq_Allele2          TT
Hugo_Symbol      DZANK1
HGVSc c.-1_1dupAA
HGVSp     p.Met1?
anoronh4 commented 3 years ago

this situation has appeared again and is mentioned here: https://github.com/mskcc/tempo/issues/838

i would propose replacing the parsing of dup|ins|del|inv HGVSc strings with the following:

        elif re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc):
            position, hgvsc_type, sequence = re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc).groups()

        elif re.match(r'^c\.(-?\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc):
            position, hgvsc_type, sequence = re.match(r'^c\.(\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc).groups()

        else:
            sys.exit('Error: not one of the known HGVSc strings: ' + hgvsc)

        position = int(position) - 1
        if hgvsc_type in 'dup,ins':
            alt_allele = sequence
        elif hgvsc_type == 'del':
            ref_allele = sequence
        elif hgvsc_type == 'inv':
            ref_allele = sequence
            alt_allele = self.reverse_complement(sequence)
        ref_allele = ref_allele if position > -1 else ref_allele[position * -1:]
        alt_allele = alt_allele if position > -1 else alt_allele[position * -1:]

        ## start of mutated region in CDS
        cds = re.search(self.cds_seq + '.*', self.cdna_seq).group()

        seq_5p = cds[0:position] if position > -1 else ''
        seq_3p = cds[position:len(cds)] if position > -1 else cds

        #print self.hgvsp + '\t' + self.variant_class + '\t' + self.variant_type + '\t' + self.ref_allele + '\t' + self.alt_allele + \
        #      '\t' + self.cds_position + '\nFull CDS: ' + self.cds_seq + '\nSeq_5: ' + seq_5p + '\nSeq_3' + seq_3p + '\n>mut_1--' + mut_cds_1 + '\n>mut_2--' + mut_cds_2 + '\n>mut_3--' + mut_cds_3
        self.wt_cds = seq_5p + ref_allele + seq_3p[len(ref_allele):len(seq_3p)]
        self.mt_cds = seq_5p + alt_allele + seq_3p[len(ref_allele):len(seq_3p)]

for dup variants i preferred the number after the underscore because the first position occurring before the underscore is actually referring to the start of the reference allele, whereas the second position is the start of the alt allele. if there is no underscore we can process like the others (ins|del|inv).

Looking forward to hearing someone's thoughts on this solution.